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An important consideration in the design of high performance multiprocessor sys- 
tems is to ensure the correctness of the results computed in the presence of transient and 
intermittent failures. Concurrent error detection and correction have been applied to such 
systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABET) has 
been suggested as a cost-effective concurrent error detection scheme. The research 
reported in this thesis has been motivated by the complexity involved in the analysis and 
design of ABFT systems. To that end, a matrix-based model has been developed and, 
based on that, algorithms for both the design and analysis of ABFT systems are formu- 
lated. These algorithms are less complex than the existing ones. In order to reduce the 
complexity further, a hierarchical approach is developed for the analysis of large sys- 


tems. 



DEDICATION 


To my Parents and Brothers 
And to the memory of my Uncle K. N. Pillai 




ACKNOWLEDGEMENTS 


I am deeply grateful to my thesis advisor. Professor Jacob A. Abraham, for his 
patient guidance and helpful suggestions. His encouragement, concern, and insight in 
academic as well as nonacademic matters were invaluable sources of support throughout 
the course of this work. I would also like to thank Professors Prithviraj Baneijee, Rav- 
ishankar K. Iyer, W. Kent Fuchs, and C. L. Liu for being members of my dissertation 
committee and for their time and support I gratefully acknowledge Robert Mueller- 
Thuns and Professor Daniel G. Saab for many interesting discussions and helpful sugges- 
tions. The friendship of Madhav Desai, Rabindra Roy, Subhodev Das, and Abbas Butt 
deserves special mention. I am also thankful to my colleagues and friends in the Center 
for Reliable and High Performance Computing (CRHQ at the Coordinated Science 
Laboratory. A big thank you to: Biju, Leena, James, Kunjumol, Thomas Panthaplam, 
G’mon, Thomas, Abe, and Manoj Franklin for making me feel at home away from home. 
Finally, I would lik* to thank my parents and brothers for their everlasting love and sup- 
port which made this thesis a reality. 

This research was supported by the National Aeronautics and Space Administration 
(NASA) under Contract NAG 1-613 at the University of Illinois. 



vi 


TABLE OF CONTENTS 


CHAPTER 

1. INTRODUCTION 

1.1. Fault-Tolerant Multiprocessor Systems 

1.2. Concurrent Error Detection (CED) 

1.3. Previous Research 

1.4. Thesis Outline 

2. ALGORITHM-BASED FAULT TOLERANCE 

2.1. Introduction 

2.2. General System Description 

2.2.1. Faults and errors 

2.2.2. The concept of (g, h) checks 

2.3. Characteristics of ABFT 

2.4. ABFT Techniques for Matrix Operations 

2.4.1. Real-n um ber codes for fault- tolerant matrix operations 

2.4. 1.1. General description of linear codes 

2.4.2. Systematic codes 

2.5. Conclusions 


PAGE 

1 

1 

4 

5 
7 

11 

11 

12 

13 

15 

17 

18 
20 
21 
22 
26 



vil 


3. A MODEL FOR ALGORITHM-BASED FAULT TOLERANCE 

3.1. Introduction 

3.2. Graph Representation of a System 

3.2.1. Detection and location of faults using the graph model 

3.2. 1.1. Conditions on fault detection 

3.2. 1.2. Conditions on fault location 

3.2.2. Limitations of the graph-theoretic model 

3.3. An Improved Matrix-Based Model 

3.3.1. The model matrices 

3.3.2. Physical significance of the model matrices 

3.3.3. Check invalidation 

3.4. Conclusions 

3.4.1. Comparison between the graph model and the matrix 

model 

4. ANALYTICAL APPLICATIONS OF THE MATRIX-BASED MODEL 

4.1. Introduction 

4.2. Fault Analysis of a System 

4.3. Analysis for Fault Detectability 

4.3.1. Algorithm to check whether R is completely detectable .... 

4.4. Analysis for Fault Locatability 

4.4.1. Physical significance of disagreement 


27 

27 

29 

32 

33 

34 

35 

36 
36 

38 

39 

40 

40 

43 

43 

44 

45 

46 
51 
54 




4.5. Complexity of the Algorithms 

4.6. Examples for the Applications of the Model 

4.7. An Alternative Approach to Check Invalidation 

4.7.1. Secondary analysis 

4.7.1. 1. Algorithm to check whether/is an STS 

4.7.2. Analysis to determine actual locatability 

4.8. Further Extensions 

4.8.1. Description of the diagnostic algorithm 

4.9. Results and Conclusions 

5. DESIGN OF ABFT SYSTEMS 

5.1. Introduction 

5.2. Previous Work 

5.2.1. A few sample bounds 

5.2.2. Limitations 

5.3. A New Approach for the Design of FTMP Systems 

5.3.1. Problem definition 

5.3.2. Construction of the actual system 

5.3.3. Comparison with previous schemes 

5.4. Conclusions 


viii 

59 

60 
68 
69 
71 
74 

76 

77 

78 

80 

80 

81 

82 

83 

84 
84 
88 
92 
92 


5.4.1. An alternative approach 


93 



6. HIERARCHICAL DESIGN AND ANALYSIS 


6.1. Introduction 

6.2. Independent and Orthogonal Checks 

6.3. The Hierarchical Approach 

6.3.1. Construction of a hierarchical system 

6.3.2. The number of checks in the hierarchical system 

6.3.3. Hierarchical analysis of systems 

6.4. Conclusions 

7. CONCLUSIONS 

7.1. Summary of Results 

7.2. Suggestions for Future Research 

REFERENCES 


VITA 




X 


LIST OF FIGURES 


Figure 

Page 

10 

1.1. 


20 

2.1. 

Matrix multiplication on a m esn-connec iea proceaMjr <uiay 

32 

3.1. 

Graphical representation ot the system in rxampic j.i 

47 

4.1. 

Graphical representation oi an example syaicm 

49 

4.2. 

Example tor error collapsing * 

56 

4.3. 

Fault patterns or caruinauty ** 

58 

4.4. 

The PC matrix ot tne nypouieucai 

63 

4.5. 


67 

4.6. 

4.7. 


68 

Data rotation in tne 

87 

5.1. 

Construction oi a prouuct syaicni 

91 

5.2. 

Design of the final system rrom me proauei ^yaicm 

97 

6.1. 

62. 

6.3. 

6.4. 

t A«i/{ Knnn/lAn cvctPTTl c 

101 

Examples tor unoounaea ana oounueu 

103 

Hierarchical expansion oi a odsit ^picui 

109 

Hierarchical expansion oi a nne<u on ■ 

110 

6.5. 

Hierarchical expansion oi auof 

112 

6.6. 

6.7. 

Unnecessary checics in tne seconu level ui iuci<m,iijr 

The PC matrix of a hierarchical system 

115 



CHAPTER 1. 


INTRODUCTION 


1.1. Fault-Tolerant Multiprocessor Systems 

Multiprocessing has become a viable alternative to serial computing to meet the 
high-performance requirements in various scientific, engineering, medical, military, and 
basic research areas. High speed of computation, high throughput, large volumes of pro- 
cessed data, and long periods of reliable operation are some of the common requirements 
in most of these applications. With the help of modem VLSI technology , complex pro- 
cessor chips containing up to 10 6 transistors have been designed and marketed to meet 
the high computation requirements. 

Unfortunately, performance and reliability are two contradicting requirements. As 
the rate of computation increases, the probability of an error in the computed result also 
increases. There are various reasons for this. First of all, the complexity of the processor 
increases with its computation capability; it has been observed that the failure rate 
increases exponentially with the complexity of the chip [1]. Another observation in this 
regard is that as the computation and the communication load increase, the failure rate in 
the system also increases [2]. (Note that an increased computation rate has to be supple- 
mented with increased communications between the processors.) 
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Long periods of reliable computing are necessary in areas such as medical instru- 
mentation where a failure may lead to fatalities. Another scenario may be where the sys- 
tem is inaccessible for repair, for instance, a space satellite, unattended after its launch, is 
expected to deliver accurate data from space for a long period of time. To meet these 
acute reliability requirements, the computer should be able to withstand failures. 

Two methods have been suggested for handling failures in an electronic system: 
fault avoidance and fault tolerance [3]. In fault avoidance, the system tries to evade 
faults by design as well as by protection against fault inducing environments. However, 
it is applicable only when there is an a priori knowledge of all the possible faults. Quite 
often that is not the case. Furthermore, the cost involved in fault avoidance techniques is 
high. Therefore, fault tolerance has been accepted as the cost effective choice. 

Two approaches to achieve fault tolerance have been the static or masking redun- 
dancy techniques and the dynamic redundancy techniques. In the former, failures are 
tolerated by masking their effects; triplication and voting [4], duplication and comparison 
[5], and quadded logic [6] are some examples. In the dynamic redundancy approach, first 
the presence of a fault is detected and then a corrective action is taken in the form of 
replacing the failed unit, recomputing the result, or reconfiguring the system to isolate the 
faulty module from the rest of the system. Systems with dynamic redundancy are pre- 
ferred to systems with static redundancy due to their greater mean lifetime gains, greater 
isolation against catastrophic faults, ability to survive until all spares are exhausted, and 
their potential to utilize the lower failure rate of the redundant (usually unpowered) unit. 
However, the fault tolerance capabilities of the system are highly dependent on the qual- 
ity of the fault detection and recovery schemes. 
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Various recovery schemes, especially reconfiguration schemes, have been studied 
extensively in the past [7, 8,9, 10]. The area of fault detection seems to be less attended. 
One observes that fault detection is a more difficult problem than the reconfiguration 
problem. With the potential of microelectronic technology to provide more redundant 
processing nodes along with sophisticated switching networks interconnecting them, 
reconfiguration and replacement have become less complex issues. In contrast, detecnon 
of a fault in the system has become all the more complicated due to the complex interac- 
tion between the component processors. In order to harness fully the fault tolerance 
potentials of modem VLSI architecture, one must have efficient and high quality fault 
detection schemes. The main theme of discussion in this thesis is the detection of faults 
in multiprocessor systems. 

A fault can be detected either by off-line checking or by concurrent checking. In 
the first method, the system is brought off-line and checked for the presence of faults. 
Even though this approach has the advantage that it does not affect the real-time perfor- 
mance of the system, its application is limited since it can detect only permanent faults. 
Unfortunately, studies show that [11] more than 85% of major system failures are tran- 
sient in nature. Furthermore, a strong relationship has been observed between the 
occurrence of transients and the level of system activity. Therefore, it becomes impera- 
tive to check for faults in a system while it is in operation. The current trend is to include 
Concurrent Error Detection (CED) capability in the design of digital systems. 
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1.2. Concurrent Error Detection (CED) 

Traditionally, systems with CED are implemented using self-checking circuits [12] 
or by hardware duplication and comparison of their results [5]. Self-checking circuits are 
specially designed to operate on data elements encoded using error-detecting codes. 
Duplication of circuits can be considered as a special type of self-checking circuits that 
employ the duplication code. Since these traditional techniques require 200 to 300% 
hardware redundancy, they are usually very expensive. This puts the pressure on the sys- 
tem designer to come up with cost effective schemes. 

The quality of CED techniques depends heavily upon the level at which checking is 
implemented: the gate, functional or system level Gate level techniques such as those 
using error detecting/correcting codes usually assume the conventional stuck-at fault 
model. Studies show, however, that there are faults which cannot be covered by the 
stuck-at fault model [13]. Further, due to the shrinking device dimensions, a physical 
defect affecting a small local area of a chip can result in faults in several gates. This 
points to the need for a higher-level fault model instead of the stuck-at fault model. 

Algorithm-based fault tolerance (ABFT), proposed by Huang and Abraham [14], is 
a fault tolerance scheme that uses CED techniques at a functional level. System level 
applications of ABFT techniques have also been investigated [15]. These techniques 
assume a general fault model which allows any single module in the system to be faulty 
[14]. Even though the faults are modeled at a high level, they cover all the lower-level 
stuck-at faults; also the techniques are independent of the logic design and the type of the 


IC used. 
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ABFT is widely applicable and it has proved its cost-effectiveness especially when 
applied to array processors [16]. A detailed description of ABFT techniques may be 
found in Chapter 2. The objective of this thesis is to develop efficient analysis and 

design algorithms for ABFT systems. 

L3. Previous Research 

The problem of locating faulty processors within a multiple processor system by 
temporarily halting normal operation and placing the system in a diagnostic mode has 
originally been studied using the PMC model [17] which assumed that the processors can 
individually test other processors. A test may be any sort of check by one processor on 
the operation of some other, including applying test vectors and checking the resulting 
outputs. On the basis of the test responses, the test outcome is classified as "pass" or 
"fail.” The test evaluation is always accurate if the testing unit is fault-ftee. 

The PMC model is limited to systems in which each unit alone can test some other 
units; also, different failure rates for the units in the system are not characterized. Russel 
and Kime generalized this model by broadly interpreting the concepts of faults and test 
[18, 19]. In this model, a complete testing of a unit requires combined operation of more 
than one unit An algebraic approach to digital system fault diagnosis was suggested by 
Adham and Friedman [20]. Here, a set of fault patterns is described by a Boolean 
expression. To be applied to large systems, this approach requires tools for efficiently 
manipulating Boolean expressions containing large number of variables. Another gen- 
eralization of the PMC model has been suggested by Maheswari and Hakimi [21]. Their 
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model incorporates the probabilistic nature of fault occurrence. This model was further 
extended by Fujiwara and Kinoshita [22]. 

The analysis of ABFT systems is much harder than the analysis of systems con- 
sidered in the above mentioned studies. In the PMC model and in its generalizations, 
researchers assume that complete tests are available for individual processors [18, 19]. 
That is, if the tested unit is faulty and the tester is fault-free, then the test is guaranteed to 
fail. However, in systems using ABFT, a particular fault pattern can produce a number 
of different error patterns. The checking operations detect the errors directly and the 
faults indirectly. Since the error detectability of the checks is finite, even if the check 
evaluating processor is fault-free, a fault in the checked unit may be undetected if the 
number of errors caused by that fault is larger than the error detectability of the check. 
(We denote these kinds of checks as incomplete checks.) Therefore, fault analysis in 
such systems is much more complex than conventional fault analysis. It may be 
observed that the systems using incomplete checks are supersets of systems using com- 
plete checks. This is because a complete check can be viewed as an incomplete check 
with infinite error detectability. 

The first attempt towards modeling ABFT systems was maH<» by Banerjee and Abra- 
ham [23] who proposed a graph- theoretic model. In this model, the system is represented 
by a tripartite graph having three groups of nodes: nodes of type F corresponding to the 
possible faulty processors, nodes of type E corresponding to the output data elements on 
which the errors may occur, and nodes of type C corresponding to the checks. There is an 
edge from an F node i to an E node j if data element dj is affected by processor P t . There 
is an edge from node j of type E to node k of type C if the data element dj is checked by 
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check c k . For the analysis of faults in the system, a generalized error table (GET array) is 
constructed from the graph model [23]. The GET array contains all possible error com- 
binations of the faults under consideration. The detectability or locatability of a fault is 
determined by observing whether all the error patterns produced by that fault are detected 
or located by the checks provided. 

Even though this model can be used for the accurate analysis of systems using 
ABFT, it has some limitations. The complexity of the analytical algorithms based on this 
model is exponential in the number of data elements in the system. This leads to enor- 
mous memory and time requirements. Inefficient handling of invalidation of checks, per- 
formed by faulty processors, is another drawback of the model. However, the model 
gives a theoretical framework for representing fault-tolerant systems. 

1.4. Thesis Outline 

This thesis is organized in the following way. A detailed description of ABFT sys- 
tems is given in Chapter 2. A general description of the multiprocessor systems which 
are candidate architectures for the application of ABFT is provided. The concept of (g, 
h) checks is discussed and examples are given. We consider fault-tolerant matrix multi- 
plication in detail and derive a general set of real-number codes for fault-tolerant matrix 
operations on processor arrays. 

In Chapter 2, first we briefly describe the graph-theoretic model. Then we present 
the new matrix-based model. In this model, the relationship between processors, data, 
and the checking operations are represented in terms of three matrices, the PD matrix, the 
DC matrix, and the PC matrix. The physical significance of the model matrices is 
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explained with examples. The problem of inv alid a tin g the checks performed by faulty 
processors is transformed into a problem of error detection at the output of the faulty pro- 
cessor. This eventually simplifies the complexity of the analysis algorithms. 

Based on this model, algorithms are developed for determining the fault detectabil- 
ity and locatability of ABFT systems. Unlike the algorithms based on the graph model, 
these algorithms do not need exhaustive enumeration of errors in order to analyze the 
system completely; instead, we propose an error collapsing technique which reduces the 
complexity of the analytical algorithms from exponential to linear in the number of data 
elements, and polynomial in the number of processors. Application of these algorithms 
for the analysis of ABFT systems is illustrated with some realistic examples. Finally we 
propose an alternative method for the invalidation of checks performed by faulty proces- 
sors. 

Chapter 4 deals with the design of ABFT systems. We propose a straightforward 
methodology for designing such systems. The advantage of this technique is that it can 
hanHV error detectability and locatability simultaneously. Also, when the processors in 
the system are producing large volumes of data, the new technique results in a smaller 
number of checks when compared to those for the existing algorithms. 

Even though the complexities of the analysis algorithms are less than the complexi- 
ties of the previous algorithms [23], the computation may require a large amount of time 
and memory when the system has a large number of processors producing huge volumes 
of data. In contrast, a hierarchical approach will reduce the complexity of the algorithms 
to a polynomial in the logarithm of the processors in the system. In Chapter 6 we illus- 
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{rate a particular hierarchical approach to build large fault-tolerant multiprocessor sys- 
tems. Based on this approach a hierarchical analysis procedure is outlined. 

In Chapter 7 we give a summary of the results in the thesis. Finally, some pointers 
are given towards future research in the related area. In order to make it easy for the 
reader to place the thesis in the vast area of reliable computing, a relational tree diagram 
is shown in Figure 1.1. The area enclosed in the dotted rectangle represents the area 
covered in this thesis. Even though the figure suggests that the analysis and design tech- 
niques developed in this thesis are pertinent to ABFT systems, it should be noted that 
these techniques are applicable to other types of fault-tolerant systems as weU. 


Reliable Computing 



Fault Avoidance Fault Tolerance 



Dynamic Redundancy Static Redundancy 



Detection Recovery 



Off-line Concurrent 




Figure 1.1. Scope of this thesis. 
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CHAPTER 2. 

ALGORITHM-BASED FAULT TOLERANCE 


2.1. Introduction 

As discussed in the preceding chapter, fault detection and diagnosis are integral 
parts of any fault tolerance scheme. There are two ways to detect faults: (1) by off-line 
checking and (2) by concurrent checking. In an off-line checking scheme, the computer 
(processor) is checked for its correctness while it is not performing any useful computa- 
tion. This approach has the advantage that the performance of the computer will be unaf- 
fected by the checking operation; however, this kind of checking can detect only per- 
manent faults. Transient faults, which constitute 75-80% of faults in a computer system 
[11], will not be detected by off-line checks. In order to detect transient faults, con- 
current error detection schemes such as duplication and comparison have been suggested. 
These schemes suffer from 200-300% hardware or time redundancy. In many applica- 
tion areas this amount of overhead is unaffordable. This motivated researchers to 
develop new schemes that require less overhead. 

A concurrent error detection scheme called algorithm-based fault tolerance (ABFT) 
has been suggested by Huang and Abraham for attaining the above objectives [14]. In 
ABFT the input data elements are encoded in the form of error detecting or correcting 
codes. The original non-fault-tolerant algorithm is modified to operate on encoded data 
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and produce encoded outputs, from which useful information can be recovered easily. 
The modified algorithm will take more time to operate on the encoded data when com- 
pared to the original algorithm, and this time overhead must not be excessive. The task 
distribution among the processing elements is done in such a way that any malfunction in 
a processing element will affect only a small portion of the data, which can be detected 
and corrected using the properties of the encoding. 

It has been observed that ABFT techniques are very cost effective when applied to 
processor arrays. In this chapter we give a general description of systems which are 
good candidates for the application of ABFT. The concept of algorithm based fault toler- 
ance will be illustrated with some application examples. 

2.2. General System Description 

In this section, we describe the general features of multiprocessor systems which are 
candidate architectures for the application of ABFT techniques. It may be noted, how- 
ever, that the application of ABFT techniques is not limited to multiprocessor systems; 
they are also applicable to algorithms running on uniprocessors, probably with less 
efficiency. 

An algorithm executing on a multiple processor system is specified as a sequence of 
operations performed on a set of processors in some discrete time steps. Each processor 
has a local memory on which it -'an perform reads and writes. It can also communicate 
with other processors in the system through buffers at various input and output ports. A 
processor cannot read or write from any other processor’s local memory even in the pres- 
ence of a fault. This is not an unrealistic assumption since most of the existing fault- 
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tolerant multiprocessor systems are of the message passing type rather than the shared 
memory type. This is because in a shared memory architecture, error confinement is 
difficult, often, impossible. However, the concept of distributed shared virtual memory 
has been developed to support shared memory programming models in loosely coupled 
distributed multiprocessor systems [24]. These architectures have the advantages of a 
distributed memory parallel machine in a hardware point of view, whereas, in a software 
point of view they have the additional advantages such as ease in process migration, ease 
in passing complex data structures among processors and ease in object synchronization 
in object-oriented systems. Error recovery in such systems is described in [25]. In this 
thesis we deal exclusively with machines using message passing paradigm for communi- 
cation among the processors. 

2.2.1. Faults and errors 

A fault is any condition that causes a malfunction in a single processor while per- 
forming operations. Some of the major causes which result in faults are: (1) manufactur- 
ing defects such as photolithography errors, deficiencies in process quality and improper 
designs; (2) wear out in the field due to electromigration, hot election injection etc.; (3) 
environmental effects such as alpha particles and cosmic radiations [26, 27]. The man- 
ifestations of these faults are called errors [28]. 

An error is any dis crepancy between the expected result of an operation and the 
actual result of the operation. Since a processor performs different types of operations, a 
fault in the processor may result in errors in any of those operations. For example, if the 
processor is performing some data computation, a fault in the processor may produce a 
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wrong value of the data. If the processor is trying to read an address location, a fault may 
cause wrong address selection (addressing fault). However, certain types of faults may 
not produce any error at all. 

Algorithm-based fault tolerance schemes are based on functional fault models that 
allow any single module in the system to be faulty [14]. Even though the faults and 
errors are treated at a high level, the model covers all the stuck-at faults and the 
corresponding errors in the lower gate and circuit levels. In addition, the model is 
independent of the type of design or technology used in the IC. In summary, we assume 
Byzantine type of faults [29]. 

In order to detect the presence of a fault in a processor, we resort to a technique 
called data value checking [30]. Here, a fault is detected by detecting errors in the final 
data value generated by the processor. One observes that the problem of detection of 
various faults such as addressing faults can be translated to the problem of detecting 
errors in the computed results [31]. Therefore, all the faults are treated uniformly as 
those corrupting the final, computed result. 

On the other hand, if a particular fault does not necessarily produce any errors in the 
final data value computed by that processor, we may disregard the presence of that fault. 
The computed result of a processor may be checked by one or more other processors in 
the system. Processors which check the output of one or more processors are called 
check evaluating processors or, in short, check processors. The evaluation of a fault in a 
check processor can also be translated to the problem of error detection at the output of 
that processor as we show in Chapters 3 and 4. 
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We assume that any processor in the system is capable of performing useful compu- 
tations, check evaluation, or both. A check on the data element is any combination of 
hardware and software procedures performed on the data by processors which use the 
encoding of the data to generate a "pass" or "fail" output 

Let q be the total number of checks that are applied on the data to perform the sys- 
tem level checking and C = {c„c 2 c q ) denote the set of checks. Let n be the total 

n umb er of data and pseudo-data elements andX= {ei,e 2 ,..,e n } be the set of errors m the 
data and pseudo-data elements. The set E represents the sets of error patterns = {E 1 , 
£ 2 ,..., E 2 *}, consisting of all subsets of X. Let N be the number of processors in the sys- 
tem which includes both the processors performing useful computations as well as the 
processors performing the evaluation of the checks. Faults in the processors can be 
denoted by the set v = {/,,/ 2 where/ denotes a fault in processor Pi . The set E = 

{E 1 , E 2 ,..., E 2 * } consists of all subsets of v, and each fault pattern, E* e E, is permissible 
in the system. Fault patterns consisting of r or fewer faults are called t -faults. 

Definition 2.1. DATA (E;) is the set of data elements affected by processor E,. □ 

DEFINITION 22. CHECK (di) is the set of checks that evaluates the correctness of 
the data element ^ 

2.2.2. The concept of (g, h) checks 

Formally, a ( g , h) check is one which is defined on g data elements, d x , d 2 , ..., and 
d g , and evaluated by a check-evaluating processor such that 
(1) the check passes (outputs 0) if 

(1.1) the check-evaluating processor is not faulty, and none of the data elements 
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is in error; 

(2) the check fails (outputs 1) if 

(2.1) at least one data element is erroneous and the number of erroneous elements 
among the g data elements does not exceed h and the check-evaluating processor 
is not faulty; 

(3) the check is invalid (may output 0 or 1) if either 

(3.1) more than h data elements are erroneous, or 

(3.2) the check evaluating processor is faulty. 

The variable h is referred to as the error detectability of the check. 

Note that these checks are different from the complete checks defined in [17, 19]. In 
those works, the authors assume that whenever a checked unit is faulty and at least one of 
the checked units is fault-free, the fault in the checked unit will always be detected. The 
(g, h) checks are incomplete in this sense. In other words, even when all the checking 
units are fault-free and the checked unit is faulty, the fault may go undetected. Condition 
3.1 covers this possible incompleteness of (g, h) checks in the sense that even if the 
check evaluating processor is fault-free, it may not detect a fault in another processor if 
the number of erroneous data elements, checked by that processor, exceeds h. We illus- 
trate another important property of (g, h) checks in the following example. 

EXAMPLE 2.1. Consider a check C which checks the equality of n data elements 
when they are all correct Since the checking operation is done on n data elements, g = n. 
Any error on up to n - 1 number of data elements will be detected by the check. However, 
if the error occurs on all the n data elements in such a way that the resulting numbers are 
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still the same, the check will not detect that error. Therefore, the error detectability h of 
the check is n-1. It may be noted that, even though the check can detect a multiple 
number of faults, it cannot locate an error. In general, this is an important distinction 
between (g, h) checks and error detecting/correcting codes such as Hamming codes 


where the error detectability of t implies an error correctability (locatability) of 



Having described the general features of a system supporting algorithm-based fault 
tolerance, we will present the salient features of ABFT techniques and illustrate them 
with some application examples. 


23. Characteristics of ABFT 

This technique is distinguished by three characteristics: 

(1) Encoding the input data stream. 

(2) Redesign of the algorithm to operate on the coded data. 

(3) Distribution of the additional computational steps among the various computational 
units in order to exploit maximum parallelism. 

The input data are encoded in the form of error detecting or correcting codes. The 
modified algorithm operates on the encoded data and produces encoded data output, from 
which useful information can be recovered very easily. Obviously, the modified algo- 
rithm will take more time to operate on the encoded data when compared to the original 
algorithm; this time overhead must not be excessive. The task distribution among the 
processing elements should be done in such a way that any malfunction in a processing 
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element will affect only a small portion of the data, which can be detected and corrected 
using the properties of the encoding. 

Signal processing has been the major application area of ABFT until now, even 
though the technique is applicable in other types of computations as well. Since the 
major computational requirements for many important real-time signal processing tasks 
can be formulated using a common set of matrix computations, it is important to have 
fault tolerance techniques for various matrix operations [32]. Coding techniques based 
on ABFT have already been proposed for various computations such as matrix operations 
[14,33], FFT [34], QR factorization, and singular value decomposition [35]. Real- 
number codes such as the Checksum [14] and Weighted Checksum codes [16] have been 
proposed for fault-tolerant matrix operations such as matrix transposition, addition, mul- 
tiplication and matrix-vector multiplication. Application of these techniques in processor 
arrays and multiprocessor systems has been investigated by various researchers 
[36, 15,37]. In order to illustrate the application of ABFT techniques, we discuss fault- 
tolerant matrix operations in detail. We present some previous results in the area and 
then present some new results related to encoding schemes for fault-tolerant matrix 
operations. 

2.4. ABFT Techniques for Matrix Operations 

As mentioned in the preceding chapter, various methods such as checksum encod- 
ing, weighted checksum encoding and average checksum codes have been proposed for 
fault-tolerant matrix operations. These encoding schemes are especially suitable for 
computations in processor arrays [38]. 
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EXAMPLE 22. Consider multiplying two 2x2 matrices A and B. 


A = 


2 1 
-1 0 



We append an additional row (checksum row) to matrix A and an additional column 
(checksum column) to matrix B. Now the product of these appended matrices will have 
an additional row and an additional column that satisfy the checksum property. 


'2 r 


'l 0 l' 

-1 0 

X 

3 2 5 

i i 


- 


5 2 7 
- 10-1 . 
4 2 6 


The implementation of this multiplication on a mesh-connected processor array is 
shown in Figure 2.1. Here the encoded A matrix is broadcasted among the processors in 
a horizontal direction and the encoded B matrix is broadcasted vertically as shown in the 
figure. The resultant matrix entries are shown within the rectangles, representing the pro- 
cessors. It has been shown that this kind of computational setup can detect three simul- 
taneous faults or locate a single fault in the array. ^ 

The use of the checksum codes is limited due to the inflexibility of the encoding 
schemes and also due to potential numerical problems. Numerical errors may also be 
misconstrued as errors due to physical faults in the system. A generalization of the exist- 
ing schemes has been suggested as a solution to these shortcomings [39]. In order to 
complement those results, we prove that for every linear code defined over a finite field, 
there exists a corresponding linear real-number code with similar error detecting and 


correcting capabilities. 
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1 0 1 


2.1 


- 1.0 


1.1 


3 2 5 



Figure 2.1. Matrix multiplication on a mesh-connected processor array. 


2.4.1. Real-number codes for fault-tolerant matrix operations 

Real-number codes arc codes defined over the field of real numbers. This is a high 
level encoding scheme. In this section, we develop a general set of real-number codes 
for fault-tolerant matrix operations. We use the general definition of encoded matrices as 
given in [38]. 

DEFINITION 2.3. An encoder vector is a vector whose inner product with a 
column/row vector will produce a column/row check element □ 

DEFINITION 2.4. An encoder vector is said to be a Valid Encoder Vector (VEV) if it 
produces check elements whose properties will be preserved during matrix multiplica- 
tion, addition, transposition and LU-decomposition. 
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It has been proved that linearity is a necessary and sufficient condition for an encod- 
ing vector to be a VEV. Therefore, in the following discussion we consider only linear 
encoding schemes. 


2.4.I.I. General description of linear codes 

A data sequence U ) over any finite field can be divided into blocks of k symbols 
which are processed independently. A typical block may be represented as a row vector 
of length k 

x = [xi,x 2 , ...**] 


and the corresponding code vector is given as 

y = [yi.y2 y«]- 


Here x and y are related by 


y —x G 

where G is an kxn matrix called the generator matrix [40, 41]. Thus the row space of G is 
the linear code Y , and a vector is a code if and only if it is a linear combination of the 
rows of G. Such a code is called an (n. k) code. Error detection is accomplished with the 
help of the parity check matrix H which satisfies the condition 

G H t = 0 


The number of errors which can be detected and corrected by a code can be 
described in terms of the Hamming weight [12, 41, 40] of the code. A code of Hamming 


weight d + 1 can detect at most d errors and correct at most 



errors [12,42]. Error 


detectability may also be expressed in terms of the linear independence of columns of the 
matrix H T . A code is r error detectable if and only if any set of <. t number of columns of 
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H t are linearly independent [41]. In order to derive a correspondence between finite- 
field codes and real-number codes, we make use of the second definition of error detecta- 
bility. 

2.4 2 . Systematic codes 

Systematic codes are a special class of linear (n, k) codes. Here, (n-k) check ele- 
ments are appended to k actual data elements. If the actual data word is 

x = [xi,x 2 **] 

the corresponding code word is 

y=[x\, * 2 . ■ • • »•**« £ 1 » ^2» " * ' 

The generator matrix G of the systematic codes is of the form 

G =[I k \P], (1) 

where I k is a k-dimensional unit matrix and P is a (k x n-k) matrix. A matrix H of the 

form [~P T I /„_*] will form a parity check matrix. 

In most of the high speed processing techniques, systematic encoding is preferred 
because once the received (or computed) result is found to be error free, retrieval of the 
actual information from the code vector is straightforward. Checksum and weighted 
checksum encodings are examples of systematic encoding. However, it has been proved 
that any linear encoding is equivalent to a systematic encoding scheme, in the sense that 
any linear generator matrix can be transformed into another combinatorially equivalent 
generator matrix [41] of the form given in Equation (1). Therefore, in the following dis- 
cussion we will not make any distinction between a linear code and a systematic linear 


code. 
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LEMMA 2.1. Vectors which are linearly independent over a finite field are also 
linearly independent over the field of real numbers. 

Proof: Let us consider a finite field GF(q) where the additions and multiplications 

are done (modulo q). Suppose v„v 2 v* are linearly independent over the field 

GF(q). Let A be the matrix whose columns/rows are the vectors vt, v 2 , . . . ,v k . By 
definition of linear independence [43], there exists an kxk submatrix D of A such that 

ID I ( mod q)* 0, 

where ID I is the determinant of the submatrix D. For determining the linear dependence 
or independence of these vectors over the field of real numbers, we take the linear combi- 
nation of the rows of A, where the rows are multiplied by real numbers rather than by ele- 
ments from GF(q). If r, is the real number multiplicand of vector v it in the place of I D I , 

we will have (nr,) ID I , which is not equal to zero, since ID I {mod. q) * 0. Therefore, 
>•1 

the vectors v j through v k are linearly independent over the field of real numbers. □ 

Lemma 2.2. If vectors vj, v 2 , . . . , v* are linearly dependent over a finite field 
GF(q), they are not necessarily linearly dependent over the field of real numbers. 

PROOF: If Vi, V 2 , . . . , v t are linearly dependent, it implies that any submatrix D of 
A is such that 

ID I {mod q)- 0, 

which does not imply that (f^) ID I =0; therefore, the vectors need not be linearly 

i-i 

dependent over the field of real numbers. ^ 
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THEOREM 2.1. For any t-error detecting code defined over a finite field, there exists 
a corresponding code over the field of real numbers, with the same generator matrix and 
the same parity check matrix, whose error detectability is £ t. 

PROOF: Let cy be a t-error detecting code defined over a finite field with generator 
matrix Gf and parity check matrix Hf. From the previous discussion, we know that every 
set of t, or smaller number, of columns of Hj will be linearly independent over the finite 
field. Then, by Lemma 2.1, these columns are also linearly independent over the field of 
real numbers, which implies that for a code C r over the field of real numbers having gen- 
erator matrix G r = Gf and parity check matrix H r - Hf, the error detectability will be at 
least equal to t. By Lemma 2.2, it may be possible that a larger number of columns of Hj 
are linearly independent which effectively increases the error detecting capability of the 
code. Thus, the error detectability of C r is greater than or equal to t. □ 


The set of single -error correcting linear real-number codes presented in [44] is one 
special case of the general sets of codes established by Theorem 2.1. 

EXAMPLE 2.3. Consider the finite field GF(7) employing symbols 
{-3, -2, -1, 0, 1, 2, 3}. A matrix with all distinct columns of length two will define the 
parity matrix H of a Hamming code over the finite field GF(7). Let 


H = 


1 1 1 1 1 1 1 O' 
-3 -2 -1 1 2 3 0 1 * 


This will also define a real-number code by regarding H as being over the real numbers. 
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The corresponding generator matrix is 


G = 


'1 0 0 0 0 0 -1 3* 

0 1 0 0 0 0 -1 2 
0 0 1 0 0 0 -1 1 
0 0 0 1 0 0 -1 -1 • 
0 0 0 0 1 0 -1 -2 
0 0 0 0 0 1 -1 -3 


This real-number code can detect at least two errors or correct one error. u 

EXAMPLE 2.4. Let us consider simple parity encoding over the field of binary 
numbers. It is known that parity codes are single error detecting [40], (that is, the Ham- 
ming distance is two) with a generator matrix 

G - [Ik I P] 

where /* = [1, 1 if. It can be observed that the corresponding code (as in 

Theorem 2.1) over the field of real numbers is the simple row checksum code. □ 

The one to one correspondence between finite-field codes and real-number codes is 
a powerful result from an implementation point of view: (1) since most of the existing 
codes are proposed for finite fields, adapting those codes for real-number computations 
will be easier than inventing new codes for real-numbers; (2) the real number codes lend 
themselves to implementation in digital signal processors employing standard arithmetic 
units; (3) furthermore, they can be conveniendy implemented in software which does not 
efficiendy admit the bit by bit representation and manipulation required by finite field 

codes. 

The application of these general sets of codes gready improves the numerical per- 
formance of the fault tolerance scheme [32]. Details may be found in [38]. 
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2.5. Conclusions 

We discussed the salient features of ABFT techniques. A detailed description of 
systems supporting ABFT was presented with examples. The concept of (g,h) checks 
was elaborated and the distinctions between these checks and the Hammin g codes were 
highlighted. Finally, we considered fault-tolerant matrix operations using ABFT on a 
processor array. In the process of developing a general set of codes for fault-tolerant 
matrix operations, we proved a fundamental theorem relating the error detectability of 
finite field codes and the error detectability of the corresponding real-number codes. 


27 


CHAPTER 3. 

A MODEL FOR ALGORITHM-BASED FAULT TOLERANCE 


3.1. Introduction 

As discussed in the previous chapter, ABFT techniques are being more and more 
widely applied. Due to the critical nature of most of the application areas, it is necessary 
to know the fault tolerance capabilities of the computer system before it is put to the 
application. This requires an analytical procedure, which in turn requires a good model 
to represent the system in general. 

The analysis of ABFT systems is difficult when compared to the analysis of conven- 
tional fault-tolerant systems such as TMR and TTR. In conventional designs of fault- 
tolerant systems, designers assume that complete tests are available for individual proces- 
sors [18, 19]. That is, if the tested unit is faulty and the tester is fault-free, then the test is 
guaranteed to fail. However, in ABFT systems, errors in computed results are detected 
directly and the faults are detected indirectly. Most of the time there does not exist a 
one-to-one correspondence between errors and faults. One fault may produce multiple 
errors. If a processor is computing more than one data element, a fault in that processor 
may or may not produce an error in one or more of those data elements. For instance, a 
processor computing 3 data elements may generate 8 different error patterns (including 
the case where it does not cause any error in any three of the computed results) when it 



28 


becomes faulty. In order to detect a fault in a processor, the checking operations done on 
the processor must be able to detect all the possible error combinations. The error detec- 
tability of the checks in the system is limited and hence the checks can detect an error 
only if the size of the error pattern does not exceed the error detectability of the checks. 
Therefore, situations may arise such that there are fault fire processors checking a faulty 
processor, and still the fault is not being detected. This incomplete nature of the checks 
adds to the complexity of the analysis of ABFT systems. 

The first attempt towards modeling ABFT systems was made by Baneijee and Abra- 
ham [23] who proposed a graph-theoretic model. In this model, the system is represented 
as a tripartite graph having three groups of nodes: nodes of type F corresponding to the 
possible faulty processors, nodes of type E corresponding to the output data elements on 
which the errors may occur, and nodes of type C corresponding to the checks. Even 
though the model is especially suitable for the analysis of faults in systems using ABFT, 
the analysis of conventional redundancy techniques such as duplication, triplication, or 
NMR can easily be done using this model. The limitation of the model is that the com- 
plexity of the analytical algorithms based on this model is exponential in the number of 
data elements in the system. This leads to enormous memory and time requirements for 
the analysis of complex systems with a large number of processors, with each processor 
producing large volumes of data. However, the model forms a theoretical framework for 
representing fault-tolerant systems. 

In order to assuage the complexity of the analysis algorithms, we propose a matrix- 
based model. In this model, we define three matrices, the PD (Processor-Data) matrix, 
the DC (Data-Check) matrix and the PC (Processor-Check) matrix, which describe the 
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system as a whole. The PD matrix represents the relationship between the processors and 
the data elements computed by them. The DC matrix contains the information regarding 
which check is checking which data element The PC matrix is the product of the PD 

and the DC matrices. 

If a check processor becomes faulty, the checking operations performed by that pro- 
cessor should be invalidated. To that end we introduce pseudo-data elements associated 
with every check processor. A fault in the checkprocessor will always produce an error in 
the pseudo-data element since an infinite weight is assigned to that data element. Thus, 
check invalidation is translated to a problem of error detection at the output of a faulty 

processor. 

In this chapter we first give a brief description of the graph model. For completeness 
of the thesis, we discuss various fault detection and location constraints based on the 
model. The motivation for developing a new model is given by highlighting some of the 
limitations of the graph model. Then the matrix model is developed and the significance 
of the model matrices is explained. The modeling of ABFT systems using both the 
models is illustrated with examples. Finally, in the conclusion, we provide a critical 
comparison between the models. 

3 . 2 . Graph Representation of a System 

In this model, the system is represented as an undirected graph with four sets of 
nodes and edges between them. The first set of nodes (called processor nodes) represent 
the processors performing useful computations. The results of the useful computations of 
the algorithm form the second set of nodes (called data nodes). The set of checks form 
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the third set of nodes (called check nodes). The checks are performed on a set of check- 
ing processors, which form the fourth set of nodes (called evaluator nodes). 

Edges between processor and data nodes represent dependencies of the result data 
elements on the processors. There is an edge from a processor node p t to a data node dj if 
dj € DATAipi). Edges between data and check nodes represent the definitions of the 
checks on the data elements. If check c k operates on data element 4, then there is an 
edge between data node 4 and check node c k . Edges between check and evaluator nodes 
model the check evaluation process. If an evaluator pj participates in the evaluation of a 
check c k , there is an edge between the evaluator node pj and check node c k . 

A fifth set of nodes, the "pseudo-data” nodes, is introduced to facilitate a uniform 
network to treat faults in processors performing useful computations and faults in proces- 
sors performing check evaluations. Every check has associated with it a number of pro- 
cessors involved in the evaluation of the check. For every check-evaluator pair, (check 
c k , processor p,), there is a pseudo-data node. Since there is a one-to-one correspondence 
between an evaluating processor and a pseudo-data node for a given check, a fault in a 
processor evaluating a check means the same as an error in the corresponding pseudo- 
data element. 

The notion of invalidation of checks has been extended as errors in the pseudo-data 
elements. In the ordered set of errors, whenever there is an error in a pseudo-data ele- 
ment, the corresponding checking operation is considered to be invalid. The errors in 
pseudo-data elements and actual data elements are treated identically so that faults in 
processors performing useful computations and faults in check-evaluating processors can 
be considered without any distinction. With these observations, the system graph can be 
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simplified by merging the data and pseudo-data nodes and the processor and evaluator 
nodes. The resulting graph has three sets of nodes: processor nodes, data nodes and 

check nodes. 

EXAMPLE 3.1. Consider a hypothetical system having 4 processors P x through P 4 . 
Processors />, and P 3 produce useful data elements whereas processors P 3 , and P 4 per- 
forms check evaluations. The relationships among the processors, data, and checking 
operations are as given in the following. 

DATA (Pi) — {di, d 2 , dj) 

DATA(P 2 ) = ( d 2 , d A ) 

DATA(P 3 )={d 5 } 

DATA (P 4 ) — [ds, d 2 } 

CHECK(d x )={C x ) 

CHECK (d 2 )=[C 2 ) 

CHECK (d 3 ) = {Ci } 

CHECK (d*)= [C 2 , C 3 ) 

CHECK (d s )= [Ci) 

CHECK (d 6 ) = {C 2 ). 

CHECK {d 2 )-[C 3 }. 

It may be noted that data elements d 5 , d 6 , and d 2 are pseudo-data elements corresponding 

to checks C\, C 2 , and C 3 , respectively. Figure 3.1 shows a graphical representation of 

u □ 
the system. 

The model can be easily extended for systems having fault-secure checking units. 
In such a case, a check is invalid if and only if the corresponding pseudo-data element is 
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PI 

P2 
P3 

P4 

Figure 3.1. Graphical representation of the system in Example 3.1. 

erroneous and at least one of the useful data elements evaluated by the check is errone- 
ous. If none of the useful data elements evaluated by the check are erroneous, the check 
is not invalidated and it will detect an error in the pseudo-data element and hence the 
fault in the checking processor can be detected. 

3.2.1. Detection and location of faults using the graph model 

In this section, we describe the fault detection and location constraints derived in 
[23] using the graph model. To that end, we explain some terminologies used in that 
study. 

The set of checks that may fail for an error pattern £‘ is denoted by FAILLE'). 
When E‘ consists of a single data element, d jt the set of checks in FAIL (E‘) is guaranteed 
to fail. When E‘ contains more than one element, the condition on the set of checks in 



Cl 


C2 


C3 
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FAIL(E ') is that they "may fail" instead of being "guaranteed to fail." This is because it 
is quite possible that a check that is guaranteed to fail for an error in a single data ele- 
ment might become invalidated in the presence of other errors. However, if a check is 
not a member of FAIL (E l ), it is guaranteed to pass. The set of checks that are invalidated 
by the presence of the error pattern E‘ is denoted by INVALID fE). Then a generalized 
error table, GET, which is a 2 H x q array can be defined [23] such that GET M = 0, 1, or X, 
(where X denotes an invalid entry) if for error pattern E* present, check c* is known to 
always pass, always fail, or have an unknown result. 

In the following, we define two terms masking and exposing of faults in the context of 
error patterns produced by those faults. These terms are firequendy used in upcoming 
discussions. 

DEFINITION 3.1. A fault pattern F ; is said to be masked by a fault pattern F k if and 
only if there exist error patterns, E m e ERROR (F*) and E H e ERROR (F ), such that 
FAIL (E m ) C INVALID (£"). D 

DEFINITION 3.2. A fault in F J is exposed if it is not masked by FK Suppose f b e F J 
such that it is exposed in FK This implies that for all error patterns E m s ERROR (f b ) and 
E n € ERROR (F J ), FAIL (E m ) tJNVAUD (£"). □ 

3.2.I.I. Conditions on fault detection 

An algorithm has t-fault-detectability iff some check in C will definitely fail pro- 
vided the number of faults present in the system, on which the algorithm is executed. 
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does not exceed t. It was implicitly assumed that no check will fail if the system is fault- 
free. With these formulations, conditions are derived for r-fault-detectability [45]. 

THEOREM 3.1. An algorithm A executing on a computing system S has t-fault- 
detcctability if and only if, for every non-zero F* e F(r), it is implied that for all E 1 e 
ERROR (F‘)» GET{ = 1 for some c k e C. 

PROOF: The proof of this theorem is given in [23], 

This necessary and sufficient condition for fault detection is difficult to evaluate in 
practice. Instead, the concept of closure of a fault has been introduced [23], which is very 
similar to the closure of faults defined in [18]. Despite this concept, the algorithm for 
fault detection is based on the exhaustive enumeration of all error combinations and 
hence is exponential. However, it forms a basis for a condition for fault detection. 

3.2.I.2. Conditions on fault location 

An algorithm is said to have t-fault-locatability if and only if the application of the 
check set identifies precisely which faults are present, provided the number of faults does 
not exceed r. In order to evaluate the fault locatability of a system, the concept of row 
intersection has been used [23], s imil ar to the row intersection operation (denoted by n) 
defined in [18]. 

THEOREM 3.2. An algorithm has t-fault-locatability if and only if, for all unequal 
fault patterns, F‘, F J e F(t), it is implied that for all E m e ERROR (F‘), and for all E n e 


ERROR (F J ) 
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GET m n GET n = 0 

PROOF: The proof of this theorem is given in [23]. 

It has been observed [23] that a system is t -fault locatable if in any fault pattern of 
cardinality *, min(k, 2t+l-k) faults are exposed for k = 1, 2, . . . min(2r,n). Algorithms 
have been developed for determining the fault locatability of systems using this sufficient 
condition which again need the exhaustive enumeration of all error patterns. Based on 
these results, we derive better sufficiency conditions for t -fault locatability along with 

our second model. 

3.2 2 . Limitations of the graph-theoretic model 

Here we summarize the drawbacks of the graph model. As discussed in the preced- 
ing sections, the analysis algorithms based on this model need exhaustive enumeration of 
all error patterns and hence are of exponential complexity. Since one pseudo-data ele- 
ment is introduced for every checking operation, that will effectively increase the number 
of data elements in the system which in turn means a larger exponent of complexity. 

In the next section we propose a matrix-based model which does not have the 
above mentioned drawbacks. In order to incorporate the invalidation of checks done by 
faulty processors, we introduce one pseudo-data element per checking processor instead 
of one for each checking operation (note that a processor may perform more than one 
checking operation). The analysis algorithms are of linear complexity in the number of 
Hata elements, and polynomial in the number of processors. 


36 


3.3. An Improved Matrix-Based Model 

In an improved model for multiple processor systems, the relationships between 
processors, data, and checks can be represented by three fundamental matrices, the PD 
(Processor-Data) matrix, the DC (Data-Check) matrix, and the PC (Processor-Check) 
matrix [46]. Unlike the graph-theoretic model described in the previous section, we do 
not make any assumptions regarding the fault secureness of the check evaluating proces- 
sors in this model. Instead, the model is developed with the following general assump- 
tions. Whenever a check evaluating processor becomes faulty, all of the checks done by 
that processor become invalid (Byzantine type faults are assumed here). If a processor is 
performing both useful computation and check evaluation, we identify two kinds of 
faults associated with it: (1) observable faults and (2) unobservable faults. For an observ- 
able fault, at least one of the data elements produced by the faulty processor will be 
erroneous, whereas for an unobservable fault all the useful computation results from the 
processor will be correct In both the above cases, all of the check evaluations done by 
the faulty processor will be deemed to be invalid. 

3.3.1. The model matrices 

In the new model for multiple processor systems, the relationships between proces- 
sors, data, and checks are represented by three fundamental matrices, the PD (Processor- 
Data) matrix, the DC (Data-Check) matrix, and the PC (Processor-Check) matrix. We 
define the following model matrices in terms of parameters N, the number of processors, 
n, the number of data elements, and q, the number of checking operations in the system. 
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DEFINITION 3.3. The PD matrix is an Nxn matrix such that 


PDij - 


1 if dj e DATA {Pi) 
0 otherwise 


□ 


DEFINITION 3.4. The DC matrix is an nxq matrix such that 


Jl if Cje CHECK (dd 
UK ' i i I 0 otherwise 

DEFINITION 3.5. The PC matrix is an Nxtj matrix which is the product of the PD 
and DC matrices. ^ 

It may be noted that so far in this model we have considered only actual data ele- 
ments. Until now, there is no relationship established between a checking operation and 
the processor which performs that operation. (It may be noted that in the graph model 
this relationship was accounted for through pseudo-data elements.) However, we will 
incorporate this relationship between processors and checks performed by them in the 
next section by defining a new set of pseudo-data nodes. 

Until now, there exists a correspondence between the system graph and the model 
matrices. If we split the tripartite graph into two bipartite graphs, a processor-data graph 
and a data-check graph, the PD and DC matrices are the adjacency matrices of those 
bipartite graphs, respectively. Now construct another bipartite graph having the set of 
processor nodes and the set of check nodes as its parts such that there is an edge from 
node Pi to node c, if there is a path of length two between these two nodes in the original 
system graph. The PC matrix is the adjacency matrix of this new graph (can be a multi- 
graph). However, the correspondence between the graph model and the matrix model 
will be lost once we introduce the concept of pseudo-data elements. 
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3.3 2 . Physical significance of the model matrices 


The physical significances of the PD and the DC matrices are clear from their 
de fini tions. In the PC matrix, PC; ; represents the number of data elements of P, checked 
by check Cj. It can be seen that entries in the PD and DC matrices are either a 0 or a 1, 
whereas the PC matrix can have elements as large as n. 

The importance of these matrices in the analysis of faults in the system will be 
revealed in the following discussion. Without loss of generality, we can use the same 
matrices for representing faults and errors in the system. The only difference is that in 
the PD and PC matrices, the row corresponding to P, stands for a fault in processor P,. 
Those elements of row P, of the PD matrix will be 1 if the corresponding data elements 
are erroneous due to a fault in processor P 4 -. 



if dj is erroneous when Pi is faulty 

otherwise 


With this interpretation of matrix entries, it is easy to observe that each row in the funda- 
mental PD matrix, defined earlier, represents a faulty processor whose output data ele- 
ments are all wrong. The PD matrix will be different for different error combinations at 
the output For coherence of terminology, the PD matrices resulting from various output 
error combination are called the syndromes of the original PD matrix as in Definition 3.3. 
Correspondingly, we will also have different syndromes of the PC matrix. The DC 
matrix will te independent of the output error combination and is determined onl> by the 
system designer and hence, has only one syndrome which is the DC matrix itself. 
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3.3.3. Check invalidation 

In order to accommodate the invalidation of checks performed by the faulty proces- 
sors, we introduce pseudo-data elements into the system model. These pseudo-data ele- 
ments are conceptually similar to the pseudo-data nodes associated with the graph 
model, but are modeled and used differently. If a processor is performing one or more 
check evaluations, a pseudo -data element of infinite weight is attached to that processor. 
Later, every check done by that processor is assumed to be checking the correctness of its 
pseudo-data element also. If the pseudo-data element is erroneous, all of the checks done 
by that processor become invalid, since such a data element has infinite weight Thus, 
check invalidation is translated into a problem of error detection at the output of a faulty 
processor. 

Accordingly, the model matrices are extended as follows. Suppose m is the number 
of processors performing check evaluations. 

DEFINITION 3.6. The PD matrix is an Nx(n+m ) matrix such that 


PDij H 


if dj e DATA (Pd 

if dj is the pseudo data element of P ,• 

otherwise 


DEFINITION 3.7. The DC matrix is an (n+m)xq matrix such that 


□ 



ifC, e CHECK (d{) 

if Cj is resident in P k and d, is the pseudo data element of P k 

otherwise 


□ 


The PC matrix is obtained by finding the product of the PD and the DC matrices. 
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EXAMPLE 3.2. Let us consider the system shown in Figure 3.1. The check ci is 
performed by processor P 3 and the checks c 2 and c 3 are performed by P 4 . The 
corresponding PD, DC, and PC matrices are 



i o a 

1110 0 0 


0 1 0 

0 10 10 0 


1 0 0 

0 0 0 0 ®* 0 

DC = 

0 1 1 

0 0 0 0 0 oo 


1 0 0 



0 1 1 


PC = PDxDC = 


2 1 0 
0 2 1 
-00 


0 oo oo 


□ 


3.4. Conclusions 

In this chapter we have presented a new matrix-based model for the analysis and 
design of fault-tolerant multiprocessor systems. The great complexity of the analysis 
algorithms based on the existing graph-theoretic model was the prime motivating factor 
in proposing the new model. How the reduction in complexity is achieved will be dis- 
cussed in the next chapter. It should be noted that the matrix model is not the matrix 
equivalent of the graph model proposed in [23]. There are subtle differences in the for- 
mulations of the models. In the following, we summarize a comparison between the 
graph model and the matrix model. 


3.4.1. Comparison between the graph model and the matrix model 

As described earlier in the chapter, the graph model consists of a tripartite graph. 
Processors, data elements, and checks are the three parts in the graph. In the matrix 
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model we also identify these three entities, whose relationships are represented as the 
PD, DC, and the PC matrices. 

The main difference between the two models lies in the way check invalidation is 
handled when the processor performing that check is faulty. In the graph-theoretic model 
a pseudo-data node was defined along with each check in the system. This approach was 
borne out from the definition of fault -secure checks [12], in which checks are capable of 
indicating their correctness also. In the graph model these pseudo-data nodes are dis- 
tinguished from the actual data nodes by their respective positions in the set of data 
nodes. The disadvantages of this approach are: 

(1) The ordering of the data elements has to be preserved during the analysis; in other 
words, every data element (including the actual data and the pseudo-data) has its 
own identity. It is this constraint which causes an exponential complexity of the 
corresponding analysis algorithm as we shall see in the next chapter. 

(2) Every time a new check is added in the system, a new pseudo-data element is also 
added. Therefore, the complexity of the analysis algorithm is exponential not only 
in the number of data elements but also in the number of checks, in an indirect way. 

In the matrix-based model, whenever a processor is performing one or more checks, 
one pseudo-data element of infinite weight is added to the output data set of that proces- 
sor (the actual data elements assume unit weight in the model). The pseudo-data ele- 
ments and the actual data elements are distinguished from each other by their weights 
rather than by their positions. This permits a special grouping of actual data elements in 
a system while considering all the possible error patterns; data elements within a group 
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do not have individual identity. Based on this grouping, we illustrate an 
error collapsing technique , which eventually results in much less complex analysis algo- 
rithms. 

In the graph model, the system information is distributed and processed in two 
domains: the processor-data domain and the data-check domain. In the matrix model we 
introduced one more domain of operation, the processor-check domain. In fact, the PC 
matrix, which represents the processor-check domain, is our main work space. The PC 
domain is derived from the PD and DC domains, during which some information may be 
lost. However, most of the lost information happens to be unnecessary for the analysis as 
we shall see in the next chapter. Whenever necessary, we go back to the PD and DC 
domains to supplement the information to the PC domain. Again, selecting the PC 
domain as the main domain of analysis greatly simplifies the analysis procedure. 
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CHAPTER 4. 


ANALYTICAL APPLICATIONS OF THE MATRIX-BASED MODEL 


4.1. Introduction 

In this chapter we describe the applications of the matrix-based model for the 
analysis of ABFT systems. Following the definitions given in the previous chapter, we 
develop algorithms for determining the fault detectability and locatability of ABFT sys- 
tems. These algorithms are much less complex than the algorithms based on the graph- 
theoretic model. The new algorithm for the fault detectability analysis is of linear com- 
plexity in the number of data elements in the system, whereas the complexity of the loca- 
tability algorithm is quadratic in the number of data elements. The reduction in complex- 
ity is achieved by using: (1) a special error collapsing technique which allows the 
analysis of a system without having to enumerate all the possible error combinations; (2) 
simpler sufficiency conditions which are developed in this thesis. Even though these 
algorit hms are developed particularly for the analysis of ABFT systems, they are applica- 
ble to the analysis of conventional fault-tolerant architectures such as N-modular struc- 
tures. 

We illustrate the applications of the algorithms by analyzing various fault-tolerant 
signal processing architectures. Finally we provide an alternative method for the invali- 
dation of checks performed by faulty processors. 
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4.2. Fault Analysis of a System 

As far as the fault detectability and locatability of a system are concerned, we have 
to consider only the observable faults, since the unobservable faults are not going to 
cause any error in the useful data elements. However, it is necessary to consider the 
detectability and locatability of observable faults in the presence of unobservable faults. 
For example, consider a fault pattern consisting of faults on three processors P\,P 2 , and 
p 3 . Let the fault present in P\ be an observable fault, and the faults in P 2 and P 3 be 
unobservable faults. Now, for the system to be 3-fault detectable, it is necessary that the 
observable fault in Pi be detectable in the presence of unobservable faults in P 2 and P 3 . 

Therefore, if a fault is not observable, instead of assuming that that particular fault 
is not present, in our analysis we consider it as a detectable fault. In order to define the 
fault detectability and locatability of a system, we introduce the concept of observability 
of a fault pattern. 

DEFINITION 4.1. A fault pattern is observable if and only if at least one of the indivi- 
dual faults present in it is observable. □ 

DEFINITION 4.2. A fault pattern is said to be completely detectable if it is either unob- 
servable or it is detectable for all the possible output error combinations. □ 

In the following, we will use the terms faults and fault patterns interchangeably to 
mean either an individual fault or a set of faults depending on the situation. 

DEFINITION 4.3. A fault- tolerant system has t -fault detectability if and only if some 
check C, will definitely fail, provided the cardinality of any observable fault pattern does 


not exceed t. 
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4.3. Analysis for Fault Detectability 

For the analysis we define matrices r PD and PC which are derived from the PD and 
the PC matrices, respectively. 

DEFINITION 4.4. The r PD matrix is defined as the matrix whose rows are formed by 
adding r different rows of matrix PD, for all possible different combinations of r rows, 
and then setting all nonzero elements, except the infinity elements, to 1. D 

Note that a nonzero element greater than 1 results from the addition of rows of the PD 
matrix if the processors corresponding to those rows have common data elements. These 
nonzero elements are set to 1 in order to avoid duplication of the same data element 

DEFINITION 4.5. Matrix r PC is the product of r PD and the DC matrix. □ 

r PC is an ^ matrix. In the fault analysis, each row of r PD and r PC will represent the 
situation in which r faults are present simultaneously. As a special case, it may be 
observed that *PC = PC. 

DEFINITION 4.6. The row R of r PC is said to be completely detectable if and only if 
the fault represented by R is completely detectable . d 

If R represents an observable fault, there should be at least one element in the row R 
which is less than or equal to the error detectability (h) of the check used, for all possible 
errors. If we en um erate all the possible error combinations, the algorithm to check the 
complete detectability of a row will be as complex as the previous ones. Instead, we use 
an error collapsing technique so that the algorithm converges much faster and needs less 


storage. 
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4.3.1. Algorithm to check whether R is completely detectable 

The algorithm is outlined as ALGORITHM 1. 

In the following discussion we will describe how the algorithm works. In the first 
step, if the entries of R are all zeros or infinity, then it is an unobservable fault and hence 
is a completely detectable fault. On the other hand, if some of the entries are zeros and the 
rest are greater than A, it means that the errors caused by the fault are not detectable, and 
hence the fault is not detectable. As mentioned before, in the case of analysis of faults in 
systems, the fundamental matrices PD and PC represent the situation in which all the out- 
put elements of a faulty processor are erroneous [46]. In the algorithm we start with a 
row R which is a combination of some rows of the PC matrix. Therefore, R represents a 
fault such that the output data elements of the processors associated with R are all errone- 
ous. If at least one element of R is less than or equal to h and greater than zero (we call 


ALGORITHM 1. 

i 

(1) If the elements of row R are either zero or infinity, R is completely detectable, stop. 
Otherwise, go to step 2. 

(2) If there is no element in row R which is less than or equal to h and greater than 0, R 
is not completely detectable, stop. Otherwise go to step 3. 

(3) Find all j such that 0 <Rj£ h. 

(4) If DC{j = 1 set x PDm = 0, where j = 1, 2, . . . q. Do the same for all j obtained from 
step 3. 

(5) If the elements of the syndrome of R are either zero or infinity, then R is competely 
detectable, stop. Otherwise go to step 6. 

(6) Find the new r PC matrix by multiplying the new r PD matrix obtained from step 3 
with the DC matrix and go to step 1. 
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such an entry a "valid entry") we can conclude that the fault is detectable by some checks 
provided all the data elements from the faulty processors are erroneous. This does not 
imply that the fault will be detected for all possible error combinations. 

For example, consider a system which is graphically represented in Figure 4.1. 


Here, 


DATA(Pi)={di,d 2 ) 


DATA(P 2 )= (d 3 } 

CHECK id \) = CHECKS) = {C x } 
CHECK (d 2 )=[C 2 } 


Then we have the fundamental matrices 





1 0 


'l 1 0 


PD- 

0 0 1 

DC - 

0 1 




! 0 



1 

0 • 


Obviously, the system is single fault detectable if h=l. In order to check whether the sys- 
tem is 2-fault detectable we compute 2 PC which is equal to [2, 1]. Since there is a 1 in 
this row, the fault is detectable when all the data elements (dx,d 2 , and d 3 ) are erroneous. 


dl 



Figure 4.1. Graphical representation of an example system. 
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Now consider the situation in which the faulty processor P x produces an error in d\ 
alone. A fault in P 2 will definitely cause an error in d 3 . Thus, errors in d x and d 3 will 
invalidate the check C\ ; at the same time, check C 2 will pass since the data element d 2 is 
not erroneous. As a result if P x and P 2 are faulty, the fault may not be detected, and 
hence the system is not 2-fault detectable . 

This discrepancy is taken care of in Step 4 of Algorithm 1. The objective is to 
check whether the fault is detectable for all possible output error combinations. For that, 
we use a technique called error collapsing. All the elements of r PD which contributed to 
the valid entries of row R are set to zero. By doing this, effectively we are removing 
those errors which are detectable at the output. We may remove all those errors simul- 
taneously, because if at least one of them were not removed, that would be detectable at 
the output and hence the fault is detectable. 

The new r PC is calculated by multiplying the r PD matrix obtained after error col- 
lapsing with the DC matrix. This new matrix will be different from the old r PC in two 
ways. The new matrix will have zeros in the corresponding positions where the old r PC 
had valid entries. Some of the invalid entries in the old matrix might have become valid 
entries in the new matrix. This is because removal of errors may make some of the 
invalid checks valid. 

This iteration is done as given in Algorithm 1 to check the complete detectability of 
row R. It may be noted that the same algorithm can be used for determining the fault 
detectability of systems having fault-secure check evaluation processors. In such a case, 
all the infinity values are set to zero and the analysis is done in the same way. 
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EXAMPLE 4 . 1 . In this example, we present a simple instance of how error collaps- 
ing can help in reducing the complexity of analysis. In Figure 4.2, processor P produces 
three data elements d u d 2 , and d 3 . Data element d t (i = 1, 2, 3) is checked by check c,. 
Also, note that the check c, checks data d, only. We assume that the error detectability h 

of the checks is equal to 1. 

In order to detect or locate a fault in processor P, we start the analysis by determin- 
ing the detectability and locatability of the worst possible error, here we start with the 
case where all three data elements are erroneous. Since check c; is checking d, only, an 
error in </, will always be detected by c, irrespective of the status of the rest of the data in 
the system. Thus, we need check the detectability and locatability of only one error pat- 
tern (in which all the data elements are erroneous) instead of checking for all the possible 
(eight in this case) error combinations. ^ 

In the following example, we illustrate the application of the error collapsing tech- 
nique, for the analysis of a hypothetical fault-tolerant multiprocessor system. 



Figure 4.2. Example for error collapsing. 


50 


Example 42 . Consider a 4-processor fault-tolerant system (for simplicity of illus- 
tration, we assume that the checks will yield valid results even when the processors 
which perform those checking operations are faulty) whose PD and DC matrices are 








■1 

0 

Or 


1 

1 

1 

0 

0 

0 


0 

1 

0 


0 

1 

0 

1 

0 

0 


1 

0 

0 

PD = 

0 

0 

0 

0 

1 

0 

DC - 

0 

1 

1 


0 

0 

0 

0 

0 

1 


1 

0 

0 









0 

1 

1 


The corresponding PC matrix is 


PC = 


*2 1 0 
0 2 1 
1 0 0 
0 1 1 


R 2 
R 3 * 

R 4 


Assuming that the error detectability of the checks h = 1, we consider complete 
detectability of rows R ! , and R 2 . Since the second element oiR\ is a valid entry, we col- 
lapse the corresponding error PD lt2 of the PD matrix. The resulting row syndrome of R x 
is [2 0 0] which has no valid entires at all, and hence is undetectable. Therefore, R t is 
not completely detectable. In the case of R 2 , if we collapse the error PD 2A corresponding 
to the valid entry in R 2 , the resulting row syndrome will be [0 1 0] which still has one 
valid entry. If we further collapse the error corresponding to that syndrome also, the 
resultant syndrome will be [0 0 0]. Then by Algorithm 1, R 2 is completely detectable. □ 

DEFINITION 4.7. The matrix r PC is said to be completely detectable if and only if 
all rows of r PC are completely detectable. □ 

THEOREM 4.1. A fault-tolerant system is t -fault detectable if and only if the 
matrices ‘PC, for i - 1 , 2 , ... t, are completely detectable. 
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Proof: 

Proof for the necessary condition, (by contradiction): If possible, let the system be 
t-fault detectable, and let *PC not be completely detectable for some i S t. This will 
imply that there exists a fault pattern of cardinality <: t which is not completely detect- 
able. Therefore, the system is not t-fault detectable which is a contradiction. 

Proof for the sufficiency condition: Complete detectability of PC implies that 
every fault pattern represented by the rows of ‘PC are completely detectable. Therefore, 
the hypothesis of the theorem implies that every fault pattern of cardinality £ t is detect- 
able and hence the system is t-fault detectable. E 

4.4. Analysis for Fault Locatability 

Analyzing the system for its fault locatability is a much harder problem when com- 
pared to the problem of finding the fault detectability. This is because, in the case of 
locatability, we have to check not only whether some faults are detected, but also 
whether that fault is distinguishable from other faults. 

DEFINITION 4.8. A system is said to have t-fault locatability if and only if the appli- 
cation of the check set identifies precisely which faults are present, provided the cardinal- 
ity of any observable fault pattern does not exceed t. (D 

LEMMA 4.1. A necessary condition for t— fault locatability for t > 1 of a system is 
that I iPDij £ 1 for all j. 

PROOF: We may prove the lemma by contradiction. Z,PDy < 1 implies that there 
can be at most one 1 in every column of the PD matrix which means DATA(Pi) C\ 
DATA(Pj) = <|> for all i*j. If possible, let there be more than one 1 in a column, which 
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implir-s that the data sets produced by certain processors are not disjoint. In that case, if 
an error is observed in the common data element or elements, it will not be possible to 
conclude which faulty processor produced that error. Therefore, the system is not single 
fault locatable. Therefore, the assumption that there is more than one 1 in a column is 
wrong, and hence the proof. □ 

In most of the existing multiprocessor systems, processors have nondisjoint output 
data sets so that the assumption DATA (Pi) n DATA(Pj) = $ is not valid. In those systems, 
locating a faulty processor will not be possible according to Lemma 4.1. However, pro- 
cessors whose data sets have nonempty intersections with other processors can be col- 
lapsed into processor classes [23] so that the processor classes will have disjoint data sets. 

DEFINITION 4.9. A processor class JC; represents a maximal set of processors such 
that for each processor pj e ic„ there exists another processor p k e it,-, such that DATA (p j) 
n DATA (p k ) * 0. □ 

Any processor not belonging to any such processor class constitutes a class by itself. 
One may be able to locate a faulty processor class (a processor class is said to be faulty if 
at least one of the processors in the class is faulty) during the fault diagnosis of the sys- 
tem. The PD matrices for the processor classes axe found by adding together the rows of 
the PD matrix corresponding to the processors in the processor class and by setting all 
nonzero elements to 1. 

In the example given in Section 2.2 for the model matrices of a system, DATA(P i )n 
DATA(P 2 ) = {d 2 }. Here, P\ and P 2 will form a processor class. Processors P 3 and P 4 
form two different processor classes. Now the corresponding PD and PC matrices are 
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'l 1 1 1 0 o' 


'2 2 l' 

0 0 0 0 1 0 

PC = 

1 0 0 

0 0 0 0 0 1 
m 


0 1 1 


For convenience, in the forthcoming discussion, we use the term locatability to mean 


class locatability . 

DEFINITION 4.10. Rows R , and R 2 of matrix k PC are said to have 1 - 0 disagreement 

if there is at least one valid element in row R x such that R 2 has a zero in the correspond- 

□ 

ing position. 

DEFINITION 4.11. Rows R { andtf 2 of matrix k PC are said to have 0 - 1 disagreement 

if there is at least one 0 in either row R,orR 2 such that the other row has a valid element 

□ 

in the corresponding position. 


EXAMPLE 43. Consider the PC matrix given below. 


PC 


R i 
*2 
R 3 


2 2 l' 
10 0 . 
0 1 1 


Here, R\ and/? 2 have 1 - 0 disagreement whereas R z and R i have only 0 - 1 disagreement. 
It can be seen that R i and R 3 have no disagreement at all. a 

DEFINITION 4.12. If all pairs of rows of k PC have 0 - 1 disagreement, then k PC is 


said to have a 0 - 1 disagreement. 


□ 


DEFINITION 4.13. PC has 1 - 0 disagreement with k PC if and only if every row R of 
PC has 1-0 disagreement with all rows of k PC which do not contain R. 

It may be noted that a 1-0 disagreement implies a 0-1 disagreement, whereas a 
0-1 disagreement does not imply a 1-0 disagreement. That is, 1 - 0 disagreement is a 


stronger condition than a 0 - 1 disagreement. 
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4.4.1. Physical significance of disagreement 

When two rows of r PC have 0-1 disagreement, that means that the faults 
corresponding to those rows are distinguishable provided the outputs from the processors 
involved in those faults are all erroneous. A 1 - 0 disagreement between PC and r PC will 
imply that every individual fault is exposed (or not masked) (as defined in Chapter 3) in 
all fault patterns of cardinality r + 1. This is because every row R in PC has a 
1-0 disagreement with all rows of r PC which do not contain R. In both of the above 
cases we need all data outputs from a faulty processor to be erroneous, which may not be 
the case all the time. Therefore, we define a stronger relation between rows, namely, 
complete disagreement. 

DEFINITION 4.14. A disagreement (0-1, or 1-0) between two rows is called a 
complete disagreement, if the disagreement exists for all possible error combinations 
caused by the faults associated with those rows. □ 

The disagreement defined in[31] was similar to the 0-1 disagreement defined in this 
thesis. However, it must be noted that the disagreement used in [31] was defined in the 
set of error patterns rather than in the set of fault patterns. 

In order to check for the complete disagreement between two rows we use an algo- 
rithm similar to the one used for finding the complete detectability. The procedure is 
outlined in Algorithm 2. Whenever there is a disagreement between rows, the valid entry 
or entries which caused the disagreement are set to zero by error collapsing as described 
in Algorithm 1. The algorithm always converges because removal of an error will never 
convert a 0 to a 1 or a higher value, whereas it may or may not decrease the values of 
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nonzero entries. (Henceforth, we use the term disagreement to mean 
complete disagreement.) Now we prove some necessary and sufficient conditions for the 
t-fault locatability of a system, and develop an algorithm for the analysis. 

From previous discussions in Chapter 3, we observe that whenever the cardinality 
of the fault pattern is £ t, all individual faults should be exposed in order for the fault pat- 
tern to be locatable. When the cardinality is 2 t-r for 0 <r <t, a minimum of r + 1 faults 
should be exposed. In order to check whether the system is t-fault locatable, we have to 
consider all fault patterns of cardinality <2r. We prove a simpler sufficient condition for 
t-fault locatability so that we need to consider only fault patterns of cardinality at most t. 

THEOREM 42. A necessary and sufficient condition for t-fault locatability is that all 
individual faults are exposed in every fault pattern of cardinality £ t, and all fault patterns 
of cardinality t are distinguishable from each other. 

PROOF: We prove this theorem through a simple construction. We use rectangles 
to represent faults. The length of the rectangle corresponds to the cardinality of the fault 


ALGORITHM 2. 

Input to the algorithm are rows R \ and R 2 whose complete 
disagreement is to be checked. 

(1) Check whether R \ and R 2 have a disagreement. If not, output NO, stop. Otherwise 
go to (2). 

(2) Collapse errors and check whether the syndrome elements of either R x or R 2 are all 
zeros or infinity. If so, output YES, stop. Else go to step 1. 
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pattern. When two fault patterns have common individual faults, the rectangles overlap 
in their positions. Consider two faults F i and F 2 whose cardinalities are ^ t. 

Case 1. 

Let the cardinality of F| * IFj I - r and F 2 <zF x . The representation using rectangles is 
shown in Figure 4.3. By assumption, all individual faults in F x are exposed (since the 
cardinality is £f), which implies that there is a 0 - 1 disagreement between regions A and B. 
But F i — B which implies F i has a 0 — 1 disagreement with F 2 . That is, F i and F 2 are dis- 
tinguishable: 

Case 2. 



Figure 4.3. Fault patterns of cardinality £ t. 
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Let I F [ I = t and IF 2 I = r < t. Also F x n F 2 = 0. We augment F 2 with some faults con- 
tained in pattern F x such that the augmented F 2 (F' 2 ) has cardinality t. Now F x and F\ 
are distinguishable by the assumption of the theorem. That is, region A has a 
0 - 1 disagreement with region C. (Note that since region B is common to both F x and F ' 2 , 
it will not contribute to the distinguishability of the faults.) But C —F 2 and hence F \ and 
F 2 are distinguishable. 

Case 3. 

Let IF! I < t, \F 2 \ <randFj nF 2 *0 , also IFj uF 2 l > t. (This is the most general 
case.) In the figure we construct a fault F' 1 by augmenting F x with some faults (C’) from 
region C such that I F' 1 1 = t. Similarly F' 2 is constructed by adding a portion of A (AO to 
F 2 . Now F '1 and F' 2 are distinguishable, which means regions A -A' (pan of region A 
which contains the faults which are not contained in A') and C -C have a 0 - 1 disagree- 
ment. But A - A' c Fi , and C - C c F 2 . Therefore, Fi and F 2 have a 0 - 1 disagreement 
and are hence distinguishable. 

Thus any two fault patterns of cardinality S t are distinguishable and hence the 
sufficiency condition is proved. Proof for the necessary condition follows from the 
definition of t-fault locatability. □ 

The above results may be translated into the domain where we use the new model 
for the analysis of fault-tolerant systems. 

THEOREM 4 J. a given fault-tolerant system is t-fault locatable if and only if 
matrices PC and ‘PC, for i = 1, 2, . . . (f-1), have 1 - 0 disagreement, and l PC has 
0—1 disagreement with itself. 
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PROOF: The condition that PC has 1-0 disagreement with ‘PC implies that all indi- 
vidual faults are exposed in fault patterns of cardinality £ t. Since l PC has 0 - 1 disagree- 
ment with itself, all fault patterns of cardinality t are distinguishable. Hence the system is 
t- fault locatable by Theorem 4.2. D 

EXAMPLE 4.4. Now we will present a hypothetical system in order to illustrate the 
various concepts we developed in the preceding sections. Consider a computing system 
consisting of 5 processors each of which produces one data element each. Every proces- 
sor performs useful computation as well as 4 checking operations on the data produced 
by the remaining 4 processors. The PC matrix for such a system is shown in Figure 4.4. 
In this example we assume that the processors involved in checking operations are not 
fault secure. Complete analysis of the system shows that it is 4 -fault detectable and 
2-fau lt locatable. It may be noted that this is the maximum detectability and locatability 
possible with a system having 5 processors. 


PC = 


00000000IOOOIOOOIOOOIOOO 
1000~oooo~010001000100 
OlOOOlOOooooooooOOlOOOlO 
001000100010° oooooo °0001 
0001000100010001<x»°o o o «x> 


Figure 4.4. The PC matrix of the hypothetical system. 
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4.5. Complexity of the Algorithms 

In the following, we provide a rigorous analysis of complexities of various algo- 
rithms proposed in preceding sections. Assume that n = N xd, where n is the total number 
of data elements, N is the total number of processors, and d is the average number of data 
elements produced by each processor. Complexity of the algorithm based on the graph- 
theoretic model for determining the fault detectability (r faults) is exponential in the 
number of data elements n in the system. The algorithm has to check the detectability of 
all possible error combinations caused by every fault in F(t). 

Therefore, the complexity is 

o( 2* PS) x2 :*). 

i* 1 ^ 

= 0('z l p3 x 2 T ) 

fit 

= 0(N'x 2 s ) 

which is polynomial in N for a given value of t<*N (which is usually the case) and 
exponential in n. 

Because of the error collapsing technique we use, the complexity of the algorithm 
based on our second model is linear in the number of data elements as shown in the fol- 
lowing. The complexity of the detectability algorithm is O ( ^ xf (id ) ), where 

f(id) is the number of steps taken in error collapsing which is bounded by 1 </ (id) < (id). 
More simplification will yield that the complexity 

-OQPxf) 
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= O (N‘~ l xnt ), 

which is a polynomial in N and a linear expression in n. 


The complexity of the algorithms for fault locatability analysis is higher than the 
complexity of the detectability algorithms. In the graph theoretic model based algo- 
rithms [23] l -fault locatability of the system is determined by checking whether every 
individual fault is observable in the presence of fault patterns of cardinality ranging up to 


21. Therefore, the complexity is O ( 



ft* 1 )" 

x2 a xJV x2 d ) = 0(// a+l x2 N ). Due to the 


new sufficient condition established in Theorem 4.2 and due to the error collapsing tech- 
nique, the complexity of the algorithm based on the matrix model is only 

O ('£[^ x/(id) xNxf(d) + M 2 x f(ld)) = O (JV ,_1 x l xn 2 +N 21 - l x In). 

Here the second term corresponds to the complexity involved in comparing faults of car- 
dinality / among themselves. It may be noted that the complexity is a polynomial in N 
and a quadratic in n. 


4.6. Examples for the Applications of the Model 

In this section, we present a few carefully selected examples to illustrate the appli- 
cation of the model for the analysis of few realistic fault-tolerant architectures. 

EXAMPLE 4.5. Consider matrix multiplication using checksums on a mesh con- 
nected piocessor array. The fault tolerance scheme has been proposed by Huang and 
Abraham [47]. We will briefly describe the system below. 

Multiplication of two 3x3 matrices X and Y is done on a mesh connected processor 
array as shown in Figure 4.5(a). We assume that input data elements are broadcast on 
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buses; the processors input the data from the buses. Under a processor failure, we 
assume that only the corresponding data element of the output matrix Z becomes errone- 
ous. After the computation, the result Z resides in the local memories of the processors. 
Now the checking operations (six of them, three for rows and three for columns) are per- 
formed. 

Thus, we have 

DATA (Pi) = di for i = 1, 2, . . . 9. 

CHECK(d\, d 2 , d 3 ) = Ci 
CHECK(d A , d 5 , d 6 ) = C 2 
CHECK(d n ,d % ,d 9 ) = C 2 
CHECK(d\, d A , d n ) = C A 
CHECK(d 2 , d 5 ,d a ) = C s 
CHECK(d 2 , d 6 , d 9 ) = C 6 . 

First, we do the analysis of the system assuming that the check evaluating proces- 
sors, P 2 , Pf,, Pi, P a and P 9 are fault secure. In that case, the fundamental matrices of the 

system will be 

PD = / 9 , where I 9 is the identity matrix of order 9. 

100100 
100010 
100001 
010100 
010010. 

010001 
001100 
001010 
001001 


DC = 
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PC = PDxDC = DC, since PD = / 9 . 

During the analysis of the system for t -fault detectability, we see that r PC is com- 
pletely detectable for r = 1, 2, 3. In ^PC the row R corresponding to the sum of rows 
Pi, P 2 , Pa and P; of PC is [2 2 2 2 0]. Since error detectability of checksum encoding is 
1 (i.e., h=l) none of the elements in R is valid. Therefore, R and hence 4 PC is not com- 
pletely detectable. Then, by Theorem 4.1, the system is 3— fault detectable. □ 

Next, we will consider the fault locating capability of the algorithm. Since the 
matrix PC has a 0 - 1 disagreement with itself, the algorithm is l— fault locatable. In the 
analysis procedure we observe that PC does not have 1—0 disagreement with PC. For 
example, row P\ of PC is [1 0 0 1 0 0] and this does not have a 1 — 0 disagreement with 
the row P 2 P a (i.e., the sum of P% and P 4) of PC which is equal to [1 1 0 1 1 0]. Hence 
the system is at most 2-fault locatable. As a next step we check whether all faults of car- 
dinality 2 are distinguishable. For that, ^C should have 0-1 disagreement with itself. 
One can observe that rows P\P$ = [110110] and P 2 P a = [1 1 0 1 1 0] of 2 PC do not 
have 0—1 disagreement. Hence PC does not have 0-1 disagreement and the system is 
1-fault locatable by Theorem 4.3. 

Now we will analyze the same system with the assumption that the check evaluating 
processors are not fault secure. According to the definition of the fundamental matrices, 
we attach one pseudo -data element each to every check evaluating processor. It may be 
observed that the system is 0-fault detectable and 0-fault locatable. This is because, the 
data element d 9 produced by processor P 9 is checked only by processor P 9 . If there 
occurs a fault in P 9 , all the checks done by P 9 are considered to be invalid and hence the 
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(a) Mesh connected array 

Figure 4.5. Processor arrays 


(b) 4-node hypercube 


error in d 9 cannot be detected. On the other hand, from the description of the algorithm 
it may be noticed that d 9 is not a useful data element as far as the original matrix multi- 
plication is concerned. Therefore, an error in d 9 may be disregarded during the analysis, 
which effectively means that the element d 9 can be taken off from the model. As far as 
check evaluation is concerned, processor P 9 checks the correctness of data elements dj, 
d 6 , d-j, and d% which are also not useful data elements. Thus, the processor P 9 is not 
doing any useful job in terms of computation or check evaluation. Therefore, we remove 
P 9 from the model. As mentioned before, since the actual data elements produced by 
other check evaluating processors are also not useful for the original matrix multiplica- 
tion, they are also discarded. Therefore, the final model should be such that each check 
evaluating processor has only one pseudo-data element associated with it. Accordingly, 


the fundamental matrices are 
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PD = 


'1 

0 

0 

0 

0 

0 

0 

0 


0 0 
1 0 
0 °° 
0 0 
0 0 
0 0 
0 0 
0 0 


0 0 
0 0 
0 0 
1 0 
0 1 
0 0 
0 0 
0 0 


0 

0 

0 

0 

0 

oo 

0 

0 


0 

0 

0 

0 

0 

0 

oo 

0 


0" 

0 

0 

0 

0 

0 

0 


DC 


'10 10 ' 
10 0 1 
10 0 0 
0 110 
0 10 1 - 
0 10 0 
0 0.1 0 
0 0 0 1 


The complete analysis shows that the system is 2 -fault detectable and 1 -fault locat- 


able. 


EXAMPLE 4.6. In this example we analyze an algorithm for fault-tolerant matrix 
multiplication using checksum encoding done on a hypercube. We consider partitioned 
matrix multiplication done on a 4-node hypercube as shown in Figure 4.5 (b). In the 
figure, the circles represent the hypercube nodes and the square represents the host pro- 
cessor. In the fault- tolerant algorithm suggested in [15], processor 1 checks the correct- 
ness of the data computed by processor 2 and sends a "pass” or "fail" signal to the host 
processor. At the same time processor 2 checks the data computed by processor 1 and 
sends a signal to the host. Similarly, processors 3 and 4 also check each other and notify 


the host of the result. 


□ 



65 


Here, even though every processor P, for i = 1, 2, 3, 4, produces data d h it is not 
necessary to include them in the PD matrix. This is because the check by the host pro- 
cessor is done only on the flag signals generated by the node processors. Let <?, be the 
flag signal generated by P,. Then, from the description of the algorithm, we have 
CHECK (e i, e 2 ) = Ci 
CHECK (e 2 , e 4 ) = C 2 . 

Since both the checks are resident in the host processor, if the host processor fails, all 
checks performed in the system will become invalid, and in that case it is O-fault detect- 
able. However, we are interested in the fault tolerance of the hypercube nodes, provided 
the host processor does all the checks correctly. Now the PC matrix is 

1 C 
1 C 

PC = Q \ • 

0 1 

It can be seen that the system is fault detectable and O-fault locatable. However, if 
the mutual checking processor pairs are changed in the fault-tolerant algorithm, that is if 
instead of 1, 2 and 3, 4, checking is done by pairs 1, 3 and 2, 4 and flag signals sent to the 
host processor, in effect we are adding two more checks given by 
CHECK (e i, e 3 ) = C 3 
CHECK (e 2 , e 4 ) = C 4 . 

The new PC matrix is 

10 10 
10 0 1 
0 110 * 

0 10 1 


PC = 


66 


Carrying out a similar analysis we found that the the algorithm is 3-fault detectable and 
1 -fault locatable. This example, was specifically chosen in order to illustrate the impor- 
tance of selecting data elements for the PD matrix so that the analysis will be easier. 

EXAMPLE 4.7. In this example we consider the Advanced Onboard Signal Proces- 
sor (AOSP) architecture [48]. The AOSP is an architectural concept for an advanced sig- 
nal processing computer that provides a fault-tolerant environment capable of supporting 
a wide range of signal processing applications. It is a loosely-coupled distributed mul- 
tiprocessor system in which a large number of identical processors known as Array Com- 
puting Elements (ACEs) communicate both data and control information via packetized 
messages over networks of high-speed buses. 

In order to achieve fault tolerance, one may incorporate some kind of system level 
fault diagnosis in the AOSP architecture. In this example we consider an AOSP with sys- 
tem level fault diagnosis. The encoding scheme to be used depends on the particular sig- 
nal processing application for which AOSP is used. In the example we do not assume 
any particular computation or encoding scheme. The only assumption made is that the 
encoding scheme can detect one error (i.e., h=l). 

The architecture of the AOSP is depicted in Figure 4.6. Due to the high density of 
the interconnection network, we have the luxury of having a fault-tolerant scheme in 
which any arbitrary processor can check the correctness of the computation done by any 
other processor. As an example we consider a scheme in which 
CHECKS g, d 4 , dj) = C , 


CHECK(d s ,d 6 ,d l ) = C 2 



Figure 4.6. AOSP architecture. 


CHECK (d 4 , d 2 , d 9 ) = C 3 
CHECK{d\, d 5 , d 9 ) = C A 
CHECKS, d 5 , df) = C 5 
CHECK(d 3 , d 5 , d 9 ) = C 6 . 

Suppose that the checks a through c 6 are performed by processors P x , P 2 , Pi> P 4, 
P 6 , and P 7 , respectively. The analysis shows that the system is 1 -fault detectable and 

O-fault locatable. ^ 

EXAMPLE 4.8. Consider an 8 node hypercube (3-cube) performing partitioned vec- 
tor computations. Each node computes three partitions (subdivisions) of the same vector. 
After the first set of computations, the partial results are rotated in the clockwise direc- 
tion in the lower dimension (involving four processors each) for further iteration as 
shown in Figure 4.7. Each computed part is check evaluated by three neighboring pro- 
cessors in their order (i.e., the first neighbor checking the first part, the second neighbor 
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checking the second part and so on). It can be determined that the fault detectability of 
the system is 1, 2 and 5 for an error detectability of 1, 2 and 3, respectively. 

4.7. An Alternative Approach to Check Invalidation 

In this section we reconsider the problem of invalidation of checks, performed by 
faulty processors and develop an alternative method to handle the problem. 

In this approach the analysis procedure is divided into two steps; (1) primary 
analysis, and (2) secondary analysis. 

DEFINITION 4.15. Home Processor of a check is defined as the processor which per- 
forms that checking operation. 

The primary analysis consists of analyzing the system with the assumption that a 
check will not be invalidated if its home processor is faulty. In the secondary step, some 
additional information related to the correspondence between processors and checking 



Figure 4.7. Data rotation in the hypercube. 
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operations performed by them is derived. With the help of the results obtained in the 
p rimar y and secondary analyses, the actual fault tolerance capabilities of the system are 
determined. Even though this approach is more tedious than the one using 
pseudo data elements, it has the advantage that it will induce an easier design technique; 
first design the pr imar y system and then decide the home processors for various checking 
operations using the properties we derive in the next section. 

4.7.1. Secondary analysis 

Before going into the details of the procedure, we define the following parameters 
associated with a fault-tolerant multiprocessor system. 

DEFINITION 4.16. Self —Tested Set (STS) is defined as a set of processors such that at 
least for one particular possible output error combination of these processors, every valid 
check done on these processors is resident in that set itself. 

Example 4.9. Consider a system described as 
DATA (P i ) = {<*!, d 2 , d 3 ) 

DATA(P 2 ) = [d 2 ,d 4 ] 

DATA(P 2 )={d 5 } 

DATA(P 4 )={d 6 ) 

CHECK (d\) = {Ci } 

CHECK (d 2 )=[C 2 ) 

CHECK (dj) = {Ci } 

CHECK (d 4 ) = {C 2 , C 3 } 

CHECK (d s )= [Ci] 
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CHECK (d 6 )={C 2 , C 3 }. 

Now assume that check Ci is resident in processor P 2 , c 2 in P 3 , and check c 3 in P 4 . 
It can be observed that if processors P 2 and P 3 are faulty and if the corresponding error 
pattern is d 2 ,ds, the valid check operations done on the output error pattern are cj and 
c 2 . These two checks are resident among processors P 2 and P 3 and hence the set {P 2 , 
P 3 } is an STS. 

In further discussions, the cardinality of an STS is denoted by S. 

DEFINITION 4.17. An STS is called a minimal STS if removal of at least one proces- 
sor from that set will destroy the property of the STS. 

DEFINITION 4.18. 5^ is defined as the cardinality of the smallest minimal STS of 
the system. 

Let /be a fault pattern involving processors Pi, P 2 , . ..Pi and let ci, c 2 , . . . Cj be the 
checks which give valid output (that is, detect the fault) when all the data elements pro- 
duced by the faulty processors are erroneous. Three cases may arise as described below. 

Case 1. 

The checks c it c 2 , . . . cj are resident in the processors of set /. In that case, set / is 
an STS. 

Case 2. 

Among the set of valid checks, some of them are resident in /and some of them are 
not resident in /. This does not guarantee that /is not an STS, since there may exist a par- 
ticular error pattern for which / is an STS. In order to check that, enumerating all error 
combinations will be inefficient Instead, we propose an error collapsing technique. 
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(Distinguish this error collapsing technique from the one described in Section 3.) 

Case 3. 

All the valid checks are resident outside the set /. This is a special instance of 
Case 2, where the number of checks resident in set /is equal to zero. 


4.7.I.I. Algorithm to check whether /is an STS 


The procedure is given in ALGORITHM 3. 

In order to simplify the implementation of the algorithm for determining whether a 
given set of processors form an STS, we define one more model matrix called an H 
(home) matrix which gives the relationship between processors and checking operations 


resident in them. 

DEFINITION 4.19. The H matrix is an nxl matrix such that 


Hu 



if Cj is resident in Pi 

Otherwise 


ALGORITHM 3. 

(1) Collapse errors which are checked by those checks which are not resident in f. 

(2) If at least one processor in / is left with no output error at all, then / is not an STS. 
Otherwise go to step 3. 

(3) Find the new set of valid check elements. If all of them are in / then the situation is 
equivalent to Case 1, and /is an STS. Otherwise go to step 1. 
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DEFINITION 4.20. The matrix r H is defined as the one whose rows are formed by 
adding r different rows of matrix H, for all possible different combinations of r rows. 

It may be noted that the PC matrix and the H matrix will have the same dimensions, 
and hence r PC and r H will also have same dimensions. 

DEFINITION 4.21. (Covering) A valid check is said to be covered with respect to 
row R of r PC if row R of r H has a one in the corresponding position. 

DEFINITION 4.22. A row R of r PC is said to be covered if all the valid entries in that 
row are covered. 

DEFINITION 4.23. A row R of r PC is said to be completely covered if row R is 
covered for at least one possible error syndrome. 

The physical significance of covering is that if a check is covered with respect to R, 
the check operation is resident in the processor set R. If row R is covered, then all the 
valid check operations are resident in the processor set R itself when all the output data 
elements are erroneous. Complete covering implies that all the valid check operations 
are resident in set R itself for at least one possible output error combination. 

LEMMA 4.2. A processor set /is an STS if and only if it is completely covered. 

PROOF: Proof follows from the definitions of STS and complete covering. □ 

Now the previous algorithm to determine the STS nature of a processor set can be 
restated and implemented in terms of covering. The algorithm is called the STS ALGO- 
RITHM. 

Example 4.10. In Example 4.9, given to illustrate the property of STS, the PC and 


the H matrices are 
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STS ALGORITHM 
To check whether /is an STS. 

(1) Collapse errors which are not covered with respect to /. 

(2) If at least one row of the PD matrix is zero, then / is not completely covered, and 
hence not an STS. Otherwise go to step 3. 

(3) Check whether the new syndrome obtained after error collapsing is covered. If so, / 
is an STS; otherwise go to step 1. 


'2 1 c 


0 0 G 

0 2 1 


1 0 0 

1 0 0 

H = 

0 1 0 

0 1 1 


0 0 1 



- 


Consider the row [1 2 1] of 2 PC (sum of 2 nd and 3 rd rows of PC), the corresponding 
row in 2 H is [1 1 0]. If h = 1, only the third check is uncovered. If we collapse the 
corresponding error, the new syndrome will be [1 1 0] which is covered. Therefore, the 
row [1 2 1] is completely covered, and hence, processors P 2 and/» 3 form an STS. 

THEOREM 4.4. If t is the fault detectability of a system obtained after primary 
analysis, any fault pattern of cardinality £ t is undetectable if and only if the set of proces- 
sors involved in the fault is an STS. 

Proof: 

Proof for the sufficiency condition follows from the definition of STS. 

Proof for the necessary condition, (by contradiction): Let the fault pattern F be undetect- 
able, at the same time F is not an STS. By the definition of an STS, this implies that set F 
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is such that for every output error pattern, there exists at least one valid check operation 
which is resident in a processor outside the set F and hence the fault is detectable. □ 

THEOREM 4.5. The actual fault detectability of the system t ac , = min(r, - 1), 
where t is the value of fault detectability obtained after primary analysis. 

PROOF: Proof of the theorem follows from Lemma 4.2. □ 

4.7 2. Analysis to determine actual locatability 

The actual locatability l act of a system can be determined by a similar type of 
analysis. Instead of STS we define another type of processor set called 
Self —Locating Set {SLS). 

DEFINITION 4.24. Self Locating Set {SLS) is a set of processors for which at least one 
output error combination exists such that all the valid check operations, which distin- 
guish faults in these processors from all other faults of cardinality less than or equal to 
the cardinality of the processor set, are resident in the given processor set itself. 

In the following, we formulate an algorithm to determine whether the given set is an 

SLS. 

Let /= [Pi, P 2 , .... P,). That is,/e r PC. Now to check whether/is an SLS we use the 
SLS ALGORITHM. 

THEOREM 4.6. Actual fault locatability l act of a system is 

lact ” fmn {l, SLjain — 1), 

where / is the fault locatability obtained in the primary analysis and is the cardinality 


of the smallest minimal SLS. 
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SLS ALGORITHM 

For all rows P,s in the PC matrix which are also 
elements of the set/ 

(1) Find all the checks which puses complete 1-0 disagreement with the row /- P, (ie., 
the set / excluding P.) of r PC. 

(2) If all checks are covered with respect to f, then /is an SLS. Otherwise, go to step 3. 

(3) Find all the checks which are not covered, and collapse corresponding errors. In 
any stage if row P; of PC becomes zero, then / is not an SLS. Otherwise go to step 2. 


PROOF: Proof of the theorem is similar to the proof of Theorem 4.4. 


EXAMPLE 4.11. In this example we present a complete analysis of a system using 


the secondary analysis we developed in this section, 
architecture illustrated in Example 4.7. The check 
processors in such a way that the H matrix is 


H = 


0 

0 

0 

0 

0 

0 

0 

1 

0 


0 

0 

0 

0 

0 

1 

0 

0 

0 


0 0 0 
0 0 0 
0 0 0 
0 0 0 
0 0 0 
0 0 0 
0 0 1 
0 0 0 
1 1 0 


We consider the fault-tolerant AOSP 
operations are distributed among the 


0 

0 

1 

0 

0 

0 

0 

0 

0 


From the primary analysis of the system, using the fundamental matrices, we have 
already found that the system is 3 —fault detectable and single fault locatable; i.e., t - 3 and 


/= 1 . 
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Secondary analysis: It may be observed that none of the rows of PC are covered, 
whereas the row corresponding to the sum of rows P% and in 2 PC is [1 10 0 0 0] and 
is covered. (Note that the corresponding row in H is also [1 10 0 0 0].) Thus, =2 
and by Theorem 4.4, actual fault detectability = min (3, 1) =1. 

According to the check distributions, check c\ is the one and only check which dis- 
tinguishes the faults in processors P(, and Pg. However, this check is covered by the 8'* 
row of the H matrix and hence SL„, tn = 1. The actual fault locatability /«., = min(l, 0) = 0. 

4.8. Further Extensions 

Applications of the matrix-based model for the diagnosis of faults in a fault-tolerant 
system were further investigated by Vinnakota and Jha in [49]. 

The diagnosis problem is defined as, given a syndrome 5, determine which fault 
produced that syndrome. The authors used the matrix-based model for locating and iden- 
tifying a fault, once its presence is detected. They observe that those faults which do not 
have 1-0 disagreement with every row of 1 *PC s are not elements of the candidate fault 
pattern. This is borne out from the fact that for every locatable fault F, every processor in 
F is checked by at least one valid check that does not check any other processor in F. 
Based on this observation, an algorithm has been proposed. However, this algorithm 
tackles the problem in a roundabout fashion. Therefore, we suggest a straightforward 
algorithm for identifying the fault present. 

DEFINITION 4.25. The matrix PCs is obtained by deleting all the columns of the PC 
matrix, corresponding to the 0’s in the syndrome 5 and then deleting all the resulting zero 


rows. 
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DEFINITION 4.26. PC Sr is the matrix obtained by deleting the rows in PC S which do 
not have a complete 1—0 disagreement with PCs- 

4.8.1. Description of the diagnostic algorithm 

Motivation behind performing up to step 3 is obvious from the definitions of the 
matrices PC S and PCsr- In PCsr we have included only checks which are flagging a valid 
"error" output (that is the output is 1). Therefore, if the number of rows in PC Sr is 
there cannot be any row representing a nonfaulty processor. If the number of rows in 
PC Sr is > /, the size of the fault present cannot be larger than l by definition of 
l -fault locatability. However, we have to check whether the fault present is of cardinality 
< /. In the following discussion we find that that is also not possible. 

If possible, let the fault pattern F be of size less than /. Again, by the definition of 
PC s r , F is checked by all the checks in PC Sr - Then for any processor p in PC Sr which is 


DIAGNOSTIC ALGORITHM 

(1) Compute PCs for the given syndrome S. 

(2) Compute l 'VCs. 

(3) Compute PC Sr - 

(4) If the number of rows in PCsr is * U then the processors corresponding to the rows of 
PC sr represent the components of the required fault pattern. Otherwise go to step 5. 

(5) Compute \ PCsr • 3he row which corresponds to the given syndrome represents the 
fault to be diagnosed. 
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fault-free, the set of checks that check that processor are checking the processors in the 
fault pattern F which are also fewer than / in number. This contradicts the conditions for 
l-fault locatability. Therefore, the only possibility is that the fault pattern that we are 
looking for is of cardinality /. 

Now to determine exactly which fault pattern has occurred, we have to consider all 
the l -combinations of the faults represented by the rows of PC*. . Then, by checking the 
equality of the rows of l PC Sr with the given syndrome, we can conclude which fault has 
been present in the system. 

It should be noted that the diagnosis procedure described here needs an a priori 
knowledge of the fault locatability of the system; one can determine the locatability of 
the system using Algorithm 2 given earlier in this chapter. In fact, the diagnosis pro- 
cedure could be integrated with the procedure for determining the fault locatability of the 
system. 

4.9. Results and Conclusions 

The concept of concurrent fault diagnosis involves the application of checking 
operations on the data generated by multiprocessor systems to obtain reliable results. We 
have proposed a new matrix-based model for analyzing the fault-detecting and -locating 
capability of such systems. A uniform framework was constructed in which faults in the 
processors performing useful computations can be treated along with faults in the proces- 
sors evaluating the checks. The necessary and sufficient conditions for fault detection 
and location were derived. The algorithms based on these derived conditions are less 
complex than the existing algorithms because of the error collapsing technique we intro- 
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duced and because of the simpler sufficiency conditions. The algorithms were used to 
determine the fault detectability and locatability of some realistic systems. 

These algorithms have been implemented in C under UNIX on SUN workstations. 
The program takes the algorithm/system description as its input The program analyzes 
the system for its fault detectability first and then uses that result to evaluate the locata- 
bility. Some typical run times are given in Table 4.1 for some of the example systems 
discussed in the preceding sections. In the table, 5, through S 4 are the systems described 
in Examples 4.6, 4.5, 4.7, and 4.8, respectively. 


Table 4.1. Typical run times for the fault diagnosis program 


Example 

#P (=N) 

#d(=n) 

h 

t 

/ 

Runtime 

(sec) 

Si 

4 

4 

1 

3 

1 

3.1 

s 2 

8 

8 

1 

2 

1 

6.0 

s 3 

9 

9 

1 

1 

0 

3.6 

s 3 

9 

9 

2 

5 

1 

118.3 

s 4 

8 

24 

3 

2 

0 

8.9 


#P is the number of processors 
#d is the number of data elements 
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CHAPTER 5. 

DESIGN OF ABFT SYSTEMS 


5.1. Introduction 

There are two ways to approach the problem of designing algorithm-based fault- 
tolerant systems: (1) given a non-fault-tolerant system, determine an efficient distribu- 
tion of checks among the output data elements so that the system has the desired amount 
of fault tolerance; (2) given a fault-tolerant algorithm, synthesize an architecture so as to 
maximize quantities such as the fault detectability and locatability of the system. Both 
the approaches have their advantages and disadvantages. In the first approach, the fault- 
tolerant design is constrained by the fixed, non-fault-tolerant architecture. In the second 
approach, performance may be sacrificed in the process of achieving high fault tolerance. 

Since most of the commercially available multiprocessors are built to maximize 
their performance, usually they do not carry any fault tolerance capabilities as such. 
According to the requirements of the application, it is up to the fault tolerance designer to 
make the system fault-tolerant Therefore, in practice, the designer is forced to adopt the 
first approach. This is the philosophy followed by previous researchers also [50,51]. 
Since the first approach is immediately applicable to existing architectures, we also look 
at the problem from the first point of view. However, our methodology is different from 
the existing ones. 
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The design of fault-tolerant multiprocessors that have processors producing more 
than one data element by modifying the non-fault-tolerant architectures was considered 
to be an intractable problem [50]. In that study, in order to design a system with the 
required fault tolerance capabilities, first the cardinality of the largest error pattern gen- 
erated by all possible faults is determined and the system is designed to detect and locate 
that number of errors. In this thesis, we propose a direct scheme to design such systems 
which eventually results in a smaller number of checks compared to the previous 
methods. The design procedure is illustrated with examples. A comparison between the 
existing schemes and the new scheme is done with respect to the number of checks 

required for each scheme. 

5.2. Previous Work 

Previous studies done by Baneijee and Abraham [50] and then by Rosenkrantz and 
Ravi [51] were geared towards computing the bounds on the number of checks required 
to be attached to a given non-fault-tolerant architecture in order to make it fault- tolerant 
in the desired amount. In the first study, bounds were derived for the number of checks 
required for the desired amount of fault detectability and locatability. The bounds for the 
detectability were later enhanced in the second study. 

In both cases, bounds were developed through algorithmic procedures to construct 
such a system. The design of systems that have processors producing multiple data ele- 
ments was considered to be an intractable problem. As a solution, they suggested an 
indirect approach. In order to design a system for t-fault detectability, first the size of the 
largest error pattern (let it be equal to s) for all fault patterns was determined and the sys- 
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tem was designed for an error detectability of s. As an example, consider a non-fault- 
tolerant multiprocessor system consisting of 4 processors. Processor P i produces 3 data 
elements. Pi, 2 data elements and P 3 and produce one data element each. If the sys- 
tem is to be designed to be 2 -fault detectable, first all the fault patterns of cardinality two 
are enumerated and their corresponding error patterns are determined. Then, the size of 
the largest error pattern is computed; in the example it is 5. (Note that the size of the 
largest error pattern caused by faults of cardinality <* is always less than or equal to the 
size of the largest error pattern caused by faults of cardinality t. Therefore, one needs to 
consider only fault patterns of cardinality t in order to determine the size of the largest 
error pattern.) Now the system is designed to have an error detectability of 5. 

The lower and upper bounds for the number of checks required were calculated in 
terms of g, h , and s. In the following subsection we give some sample bounds derived in 
[50] and [51]. 

5.2.1. A few sample bounds 

As examples, we provide the bounds derived for 2-error detectability and 
3-error detectability. It was shown in [51] that at least 2n/(g+l) checks are necessary to 
detect 2 errors. Rozenkrantz and Ravi showed that \2nl(g+Y\ checks are sufficient for 
detecting 2 errors. For 3-error detection also f 2/t/(^-«-l)j is a trivial lower bound. The 
upper bound for this case derived in [50] was qn/q+g-l where q = f (3g+l)/2| . This 
bound was later improved in [51] to a higher value f (2 n -|_n/gj )/#] + 1. 
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5.2.2. Limitations 

With the modem VLSI technology, individual nodes of multiprocessor systems are 
capable of having high computing powers. Every processor in the system may be com- 
puting a large volume of data which in turn means that the size of the error patterns pro- 
duced by a faulty processor may be large. Since the designs are done for meeting the 
error detectability criterion, this will result in large complexities. 

The methodology is efficient only when most of the t -faults are producing error pat- 
terns of cardinality s. On the contrary, if only a small number of fault patterns produce 
error patterns of cardinality s, the design procedure will be using larger number of checks 
unnecessarily. The approach lacks flexibility with respect to varying amount of compu- 
tation performed by a processor. For example, if a system were designed for an error 
detectability of s and later one of the processors is assigned a bigger load of computing 
more data elements, in this method, the whole system has to be redesigned for the new 
error detectability (to maintain the original value of the fault detectability). Unfor- 
tunately, both these scenarios exist in real life. It was reported in [48] that in the AOSP 
architecture, every computing node in the structure can support a wide variety of signal 
processing computations and often, the amount of computation performed by various 
nodes is not the same. If we incorporate system level diagnosis in this case, inefficient 
designs will result as mentioned in the beginning of the paragraph. 

In [50] and [51] the problem of minimizing the number of (g, h) checks was 
transformed into a problem of constructing a bipartite graph where the number of output 
nodes is minimized subject to the constraints of t— fault detection. Instead of using (g, h) 
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checks, checks of type ( g 1) were used where g' =lg/h\ . After constructing a graph sub- 
ject to the modified constraints, groups of h output nodes are merged together. The new 
merged graph satisfies the constraints of (g, h) checks. The limitation of this approach is 
that even though the merged graph preserves the fault detectability of the original graph, 
the locatability may not be preserved. In other words the design cannot handle detecta- 
bility and locatability constraints simultaneously. 

5.3. A New Approach for the Design of FTMP Systems 

In this thesis, we propose a straightforward design procedure which needs a smaller 
number of checks than the previous techniques especially when the computation is 
nonuniformly distributed among the processing nodes. In our approach the system is 
designed directly for meeting the fault constraints rather than error constraints. Also the 
methodology can handle both detectability and locatability issues simultaneously. The 
use of the matrix-based model allows the use of simple vector space techniques to iden- 
tify redundant checks. The flexibility involved in handling varying the amounts of com- 
putations performed by individual processor nodes is another advantage of our approach. 

53.1. Problem definition 

Using the matrix-based model parameters, we define the design problem as follows: 
Given the PD matrix, find a DC matrix so that the corresponding PC matrix has the 
required fault diagnosing capabilities. 

Since PC = PD*DC, the design involves finding two variables (actually two 
matrices) from a single equation. Therefore, the solution is not unique. The selection of 
a particular solution should optimize the number of checks required, and the number of 
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errors detectable and correctable by each check (assuming that the cost of each check 
increases proportionally to the number of errors it can detect and correct). Our approach 
to the problem consists of the following steps, 

(1) Design a DC matrix for PD = I m and h = 1, such that the system has 
t-fauU detectability and l -fault locatability. Here I M is the identity matrix of order m, 
where m is the number of processors in the system to be designed. We call this sys- 
tem the unit system of the actual fault-tolerant system to be designed. 

(2) Modify the DC matrix of the unit system according to the given PD matrix in order 
to obtain the DC matrix of the actual system. 

In the unit system, since the PD matrix is the unit matrix, every processor is produc- 
ing only one data element. Therefore, the cardinality of the fault patterns will be the 
same as the cardinality of the resulting error patterns. As mentioned before, in such a 
situation, the techniques proposed in [50,51] are efficient and can be used for the design 
of the unit system. Designs are already available for various values of fault detectabili- 
ties and locatabilities. These designs of the unit systems can be used as a template. Now 
the actual design consists of modifying these template designs to obtain the actual sys- 
tem. 

DEFINITION 5.1. The Product System of a given non-fault-tolerant system and the 
corresponding fault-tolerant unit system is defined as the system obtained by connecting 
every data element affected by processor P, in the non-fault-tolerant system to every 
check element in the unit system which checks the output of processor P,. □ 
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The construction of the product system is illustrated in the following example. 

EXAMPLE 5.1. Let us consider a multiprocessor system with four processors. The 
first processor produces 4 data elements, the second one 2, and the third and the fourth 
processor produce only one data element each. A unit system is designed for 
3-fault detectability and 1 -fault locatability. Construction of the product system is given 
in Figure 5.1. □ 

THEOREM 5.1. If the unit system is t -fault detectable and l -fault beatable, for t and 
/ > 0, then the product system is t -fault detectable and l -fault beatable if and only if 

h £ max [num (Pi)]. 

Here h is the error detectability of the checks in the product system and num (/*,) is the 
number of data elements affected by processor/*,. 

PROOF: Proof for the necessary condition, (by contradiction): Let 

h < max [num (/*,)], and the product system be t -fault detectable and l-fault beatable. Let 
processor Pj be such that num (Pj) = max [num (/*,)], where the maximum is computed for 
/ = 1, 2, 3, . . . m. If Pj fails in such a way that all the data elements produced by Pj 
become erroneous, then all the checks done on Pj in the product system will become 
invalid. Therefore, such a fault in Pj will not be detected and the system is 
(3-fault detectable which is a contradiction. 

Proof for the sufficiency condition: Since the unit system is designed for h — l, for 
any fault pattern of cardinality 5 1, there exists at least one check in the unit system which 
checks only one processor in that group of processors which are faulty. (In [31] this is 
referred to as 1 -neighbor intersection property.) In the product system also, since 
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„ , , „ Unit System 

Non-fault-tolerant J 

System 



(c) 

Product System 

Figure 5.1. Construction of a product system. 


h >max [num ( P , ) ], this check will fail and hence the system has the same detectability 
and locatability as the unit system. Locatability of the product system can be argued in a 


similar fashion. 
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5.3.2. Construction of the actual system 

Construction of the actual fault-tolerant system with given (g, h) checks is done by 
splitting each check in the product system into one or more checks such that every check 
in the resulting system has at most h data elements from each processor checked by those 
checks. Now the new system has the same detectability and locatability as the product 
system. This is because, whenever a check in the product system fails, at least one of the 
checks formed by splitting that check node will fail in the actual system. Actually, the 
detectability and locatability of the new system may be higher than the product system. 
However, we are interested only in the fact that the detectability and locatability of the 
final system are at least equal to that of the product system. 

It should be noted that in the procedure described above, instead of combining h 
checks in the unit system to form a system having checks of error detectability h, we 
attach h data elements from every processor to that check. This approach will preserve 
the fault detectability and locatability of the unit system even after converting it into the 
final system. However, when most of the processors are producing only <h data ele- 
ments, the design may not be efficient In those cases, the unit system itself ma y be 
designed by assuming that the checks have error detectability h. Correspondingly, while 
constructing the final system from the product system, the checks should be split in such 
a way that every check receives at most one data element from the processors which are 
being checked by that check. 

Another point of interest is the assignment of checks to the processors, that is, 
which processor performs which check. Once the assignment is decided for the 
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unit system the same assignment should be followed in the product system also. When 
checks in the product system (parent checks) are split into component checks, all the 
component checks are assigned to the same processor which was hosting the parent check 

in the product system. 

In the following, we present the design procedure as an algorithm, in terms of the 
matrix-based model parameters. The resulting DC matrix (and the corresponding PC 
matrix) has the required fault diagnosing capabilities. The DC matrix can be further 
simplified by deleting some of the redundant columns as follows. 

DEFINITION 5.2. Column C* (i.e., check C.) is said to be covered by one or more 
columns if and only if C, can be written as a linear combination (with coefficient of mul- 
tiplication equal to 1) of those columns. 


DESIGN ALGORITHM 

(1) Construct a DC matrix for the unit system (we call it the unit DC matrix), so that the 
unit system is t— fault detectable and l— fault locatable. 

(2) The DC matrix for the product system (called the product DC matrix) is constructed 
by expanding the columns of the unit DC matrix vertically. The row corresponding 
to processor /», in the unit DC matrix is replicated num ( P , ) times. 

(3) The DC matrix of the product system is partitioned into blocks of rows, such that 
the i lh block contains data elements produced by processor P t . 

(4) Each column in the product DC matrix is split into a minimum number of columns 
so that every column has at most h number of l’s in every block. 
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LEMMA 5.1. If C, is covered by Cji, Cj 2 , ... Cj k , then C, is a redundant check and 
can be deleted. 

PROOF: Since C, can be obtained as a linear combination of Cji , Cj 2 , ... Cj k , when- 
ever C, fails, at least one of the checks among Cji, Cj 2 , ... C jk will fail. Therefore, if the 
system has checks Cju Cj 2t ... Cj k , then check C, is redundant and hence can be deleted. 

□ 

COROLLARY 5.1. Every column is covered by another identical column, if such a 
column exists. 

Once the DC matrix is determined using the design algorithm, the DC matrix is 
further simplified by deleting all columns which are covered by some other columns. 
The procedure for modifying the product system to obtain the actual system is illustrated 
in the following example. 

EXAMPLE 5.2. We consider the same multiprocessor system that we introduced in 
the previous example. We are supposed to design a fault-tolerant system consisting of all 
these processors such that the system will be 3 -fault detectable and 1 -fault locatable. In 
example 5.1, we have already seen how to construct the product system. The rest of the 
algorithmic procedure is shown in Figure 5.2. The graphical representations of the sys- 
tems are shown along with their matrix representations. It can be observed in Figure 5.2 
(b) that the DC matrix has two identical columns corresponding to checks C 12 and C 22 . 
Either one of those checks can be deleted from the system so that the final system has 
only 4 checks. □ 
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Fault-tolerant System for h = 2 and its DC matrix 
Figure 5.2. Design of the final system from the product system. 
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5.3 3. Comparison with previous schemes 

In order to make a comparison between the newly proposed scheme and the existing 
ones, we consider the following example. The comparison is done with respect to the 
number of checks required in the design. 

EXAMPLE 5.3. Suppose a system involving 500 processors has to be designed for 
3-fault detectability and 1 -fault locatability. The checks available are of type (5,1) (i.e., g 
= 5, and h = 1). It is also known that 10 of those 500 processors produce 2 data elements 
each. 

Number of checks required as per the bounds given in [50] : 408 
Number of checks required as per the bounds given in [5 1] : 363 
Number of checks required for our scheme : 200 □ 

5.4. Conclusions 

A systematic and straightforward design methodology is proposed for the design of 
FFMS where individual processors may produce multiple data elements. Our approach 
is to transform the non-fault-tolerant system directly to satisfy the fault detectability and 
locatability constraints to obtain a fault-tolerant system. The new scheme is more 
efficient than the previous schemes with respect to the number of checks used in the 
overall system. Examples are provided for the illustration of the design methodology and 
for the comparison of various schemes. 
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5.4.1. An alternative approach 

For completeness of the thesis we briefly describe an alternative approach for 
designing a fault-tolerant system by mapping a fault-tolerant algorithm on a suitable mul- 
tiprocessor architecture which will minimize the overhead [52], To that end, a depen- 
dence graph-based approach has been suggested by Vinnakota and Jha in [52]. 

In the first stage of the design process, a particular encoding scheme is selected to 
meet the fault tolerance specifications. In the second stage, an optimal architecture to 
implement the scheme is chosen using dependence graphs. 

Dependence graphs are graph-theoretic representations of algorithms [53]. After 
the first stage of the design, the encoded algorithm is represented as a dependence graph. 
This graph is then projected in several directions to obtain different realizations of archi- 
tectures, among which the one with the optimal features is chosen. It was demonstrated 
that not all architectures are suitable for a particular ABFT scheme. In this study the 
authors claim that their approach is architecture independent However, most of the 
cost-effective fault tolerance algorithms known until now are architecture specific. 
Therefore, the selection of a particular algorithm dictates the selection of the architecture 


also. 
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CHAPTER 6. 

HIERARCHICAL DESIGN AND ANALYSIS 


6.1. Introduction 

The complexity of the detectability algorithm, based on the matrix model, is linear 
in the number of data elements, whereas the complexity of the locatability algorithm is 
quadratic in the number of data elements in the system. Even though these complexities 
are less than the complexities of previous algorithms [23], the computation may require a 
large amount of time and memory when the system has a large number of processors pro- 
ducing huge volumes of data. This motivates the development of a hierarchical approach 
to analysis which will reduce the complexity of the algorithms to a polynomial in the log- 
arithm of the number of processors in the system. 

A natural way to build large systems is to first build small units and then to con- 
struct bigger units from the small units in a hierarchical fashion. This principle has been 
followed in the design of most of the existing large multiprocessor systems. That is, a 
small unit is replicated many times with a systematic method of interconnection. For 
example, a two-dimensional mesh connected processor array may be considered as multi- 
ple replications of a linear array with corresponding elements of the copies connected in 
a linear fashion. It has been suggested that in such complex multicomputer structures, 
fault tolerance should also be handled in a hierarchical fashion [1]. However, even when 
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the error detection, fault location and recovery are performed in a hierarchical fashion, 
analysis of these systems has conventionally been done without exploiting the hierarchy 
[23,46,18,51]. 

In this chapter, we develop techniques to analyze fault-tolerant multiprocessor sys- 
tems in a hierarchical fashion. The fault tolerance of the system at different levels of the 
hierarchy is determined separately and the overall fault tolerance capabilities are derived 
from those values. In order to exemplify such an approach, we first describe a type of 
hierarchy one may follow in order to build a large fault-tolerant multiprocessor system. 
Then, we develop an analytic technique that is based on a hierarchical description of the 
system using the matrix model mentioned earlier. 

In the proposed hierarchical design, large fault-tolerant systems arc constructed 
from smaller units ( basic units) of known fault tolerance capabilities. Basic units (proces- 
sors as well as checks) are replicated several times at the next level of the hierarchy and 
new checks are introduced. This procedure is repeated recursively through various levels 
of hierarchy. The ability to analyze different checks at different levels of hierarchy 
greatly simplifies the overall analysis of large systems, as we shall see in Section 6.3. 
We derive the relationship between the fault detectability (locatability) of the basic unit 
and the fault detectability (locatability) of systems hierarchically derived from the basic 
unit In order to make the development of the theory simple, we first assume that the 
processors in the fault-tolerant systems under consideration produce only one data ele- 
ment each. However, the techniques developed in Chapter 5 may be used to extend the 
design to systems where individual processors produce multiple numbers of data ele- 


ments. 
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The organization of the chapter is as follows. In Section 6.2 we develop the concept 
of independent and orthogonal checks. Section 6.3 deals with the hierarchical design and 
analysis of fault-tolerant systems. Our conclusions are stated in Section 6.4. 

6.2. Independent and Orthogonal Checks 

In this section, we develop certain properties of (g,h) checks which are described in 
Chapter 2. These properties are eventually used in the hierarchical analysis of systems. 

DEFINITION 6.1. The Domain of a set of check S (denoted a s D(S)) is defined as the 
set of processors that are checked by these checks. 

D(S) * {P, I PC U for all Cj e S} 

where PC represents the PC matrix and d> is the null set. □ 

DEFINITION 6.2. Sets of checks S \ and S 2 are said to be independent if and only if 

DQOrPGi) = □ 

In Figure 6.1, A and B are the domains of sets of checks Si and S 2 , respectively. Since 
A(^B = <D, S i and 52 are independent checks. 

DEFINITION 6.3. det(A)\ s is defined as the fault detectability of system A when it is 
checked by S. □ 

DEFINITION 6.4. loc(A)\s is defined as the fault locatability of system A when it is 
checked by S. □ 

LEMMA 6.1. If Si and S z ate two independent sets of checks and A x and A 2 are 
their respective domains, then 

det(Ai^jA 2 )ls,os 2 = min[der(Ai)l Sl , det(A 2 )\ Si ]- 
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Figure 6.1. Independent checks. 


PROOF: Follows from the definition of independence. □ 

LEMMA 6.2. If 5 \ and 52 have the same domain A, then 

defCA)ls lU s, £ max [det(A)\ Sl , det(A)\ Sl }. 

PROOF: Since Si and S 2 are applied on the same set of processors, a fault will be 
detected if either S t or S 2 detects it Therefore, such an arrangement should be able to 
detect as many faults as either S 1 or 52, whichever is larger. D 

Similar results apply to the locatability of systems too. Now we find an upper limit 
for det(A)\ s 1( ^s 2 and loc(A)\ Sx( js^ Intuitively, the upper limit is going to be dependent 

on the domains of individual checks in S 1 and S 2 . 

DEFINITION 6.5. Sets of checks S 1 and S 2 are said to be orthogonal to each other if: 
(1) any check in Si has at most one processor in common with any check in S 2 ; (2) for 
every check in S t there is at least one check in S 2 which shares a processor with the 
check in S 1 . ^ 
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Example 6.1. The set of the row checksums and the set of the column checksums 
applied to a mesh-connected processor array form two orthogonal sets of checks. □ 

LEMMA 6.3. If St and S 2 are two sets of checks having the same domain A, then 
det(A) I is maximized when S 1 and S 2 are orthogonal to each other. 

PROOF: Fault detectability of the system will be a maximum if individual checks 
in Si share a minimum number of processors with the checks in S 2 . However, since it is 
necessary that every processor in A is checked by at least one check in Si and by at least 
one check in S 2 , the minimum number of processors that they can share is one. There- 
fore, for the detectability to be a maximum, it is necessary that the checks are orthogonal 
to each other. □ 

THEOREM 6.1. If det(A) \ Sl = fj, det(A) l Sj = r 2 , loc(A)\ Sl = / 1 , and 
loc(A)\ Si = / 2 , where ti, r 2 Sl, then 

det(A)\ Sx \jSi < (fi + l)(f 2 + l) 

loc(A)\ SiKJ s 2 < (/1 + IM /2 + D. 

PROOF: By the definition of detectability, the minimum size of a fault pattern 
which cannot be detected by Si is (t x + 1). Similarly, the minimum size of the fault pat- 
tern which cannot be detected by S 2 is (fj + 1). We want to maximize the size of the 
smallest fault pattern which cannot be detected by checks in both Si and S 2 . From the 
previous lemma, the detectability is maximized when the checks are orthogonal. In this 
configuration one can observe that the maximum of the size of the smallest undetectable 
fault is equal to the product of the size of the smallest fault pattern undetectable by S 1 
and the size of the smallest fault pattern undetectable by S 2 . That is. 
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deti.A)\ SxKJ s 2 < (* 1 + 1 ) (*2 + D- 

With a similar logic we arrive at the corresponding result for the locatability of the 

□ 

system. 

In the following section, we show that this is a reachable bound. We describe the 
construction of a fault-tolerant system which achieves the maximum fault detectability 
and locatability. 

6.3. The Hierarchical Approach 

Out main objective is to show how checks can be treated at different levels of 
hierarchy during the analysis of a system. To that end we first describe a hierarchical 
approach for the design of fault-tolerant multiprocessor systems. One may seek various 
irinHc of hierarchies to design a system. However, the particular hierarchy we suggest is 
motivated by two factors: (1) most of the existing multiprocessor systems are built using 
this type of hierarchy; for example, in the binary hypercube, an n -dimensional cube is 
constructed by connecting the corresponding processors in two rt — 1 cubes', (2) this will 
mavimiTfi the fault detectability and locatability of the overall system for a given fault 
detectability and locatability of the basic unit [54]. 

Before going into the details of the type of hierarchy we use in the design and 
analysis of systems, we will establish some properties related to the fault 
detectability/locatability of fault-tolerant systems and the error detectability of checks 
used in those systems. In this section, we assume that every processor produces only one 
data element and that the fault in a check-evaluating processor will not invalidate the 
checks performed by that processor. The presence of a fault is manifested as a single 
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error. In other words, there is a one-to-one correspondence between faults and errors. 
Therefore, in ensuing discussions, we will use the terms faults and errors interchangeably. 

DEFINITION 6.6. A fault-tolerant multiprocessor system is said to be bounded if and 
only if t < N where t is the fault detectability of the system and N is the number of proces- 
sors in the system. □ 

The concept of boundedness is relevant only if the checking operations do not 
become invalid even if the corresponding check processors fail. This may be achieved 
either by employing an external processor for the checking operation or by building the 
checking units inside the check processors to be totally self -checking. If this condition is 
not satisfied, trivially, the fault detectability of a system cannot exceed the number of 
check evaluating processors in the system and hence the system is always bounded. 

EXAMPLE 6.2. Consider a mesh connected processor array in which processors are 
checked by column checks and row checks as shown in the Figure 6.2. Let the error 
detectability of the column/row checks be h=2. Using the algorithms given in the 
preceding section we find that the array in Figure 6.2 (a) has a fault detectability = 4, and 
the one in Figure 6.2 (b) has a fault detectability = 8. Therefore, system (a) is 
not bounded whereas system (b) is bounded. □ 

DEFINITION 6.7. A check is said to be bounded if it checks more than h data ele- 
ments. □ 

LEMMA 6.4. A sufficient condition for a system to be bounded is that all the checks 


performed in the system are bounded. 


101 



(a) (b) 

Figure 6.2. Examples for unbounded and bounded systems. 

Proof: Proof by contradiction: Suppose all the checks are bounded and the system 
is not bounded, which implies N = t. However, when all the processors are faulty, every 
check will be checking greater than h errors, and hence all of them will produce invalid 
results. Therefore, we cannot detect the simultaneous presence of N faults which is a 
contradiction to the hypothesis. ^ 

It may be noticed that in Example 6.2, the system represented in Figure 6.2 (b) has 
all its checks bounded and hence the system is bounded as we had found out by other 
means. 

Lemma 6.5. If j h for i = 1,2,..., are subsystems of a given system 5 such that s t s: S, 
then 5 is bounded if and only if at least one subsystem s } is bounded; then the fault 
detectability/locatability of 5 is less than or equal to the fault detectability/locatability of 

Sj. 

PROOF: The proof for the necessary condition is trivial, since by definition of s, it 


could be the system S itself. 
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Proof of the sufficiency condition: Let sj be a bounded subsystem. Then, irrespec- 
tive of the rest of the system, the subsystem sj will have a detectability, t < I sj I where 
I S j I denotes the number of processors in the subsystem. However, if the rest of the sys- 
tem has a detectability strictly less than r, then the overall detectability (T) of the system 
will be also strictly less than t. Therefore, T £ t < I S I and hence the proof. □ 

The motivation for Lemma 6.5 is to point out that if we add more processors to an 
already bounded system (note that there are no new checks added to the existing system 
during the expansion), the fault detectability of the overall system does not increase. We 
make use of this inference in the formulation of Theorem 6.2. 

6.3.1. Construction of a hierarchical system 

We now outline the procedure to build a hierarchical system from a basic unit. Let 
B be the given basic system with a known fault detectability and locatability. First we 
replicate copies of B (replication involves replication of the checks also). Let P\, P 2 , ..., 
P r be the processors in B which are checked only by bounded checks. As a second step 
in the hierarchical expansion of the system, r new checks are introduced such that the 
first check performs the evaluation of processor in B and all its image processors in 
other copies of B, the second check evaluates processor P2 and all its copies, and so on. 
The procedure is illustrated in Figure 6.3. We do not have to provide a check at the 
higher levels for those processors which are checked by at least one unbounded check 
because an unbounded check will always fail if the processor(s) checked by that check is 
(are) faulty regardless of the presence of any other fault 
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Cl - Internal check 
CH - Check in the higher level 

Figure 6.3. Hierarchical expansion of a basic system. 

In the figure, 5 1 , S 2 . •••. B t-i are copies of the basic system B. Cl represents the set 
of checks which are internal to every copy of B and CH represents the set of checks in the 
next higher level. In the following discussion, this kind of a construction will be referred 
to as k-fold expansion of B in the next level of hierarchy. Following a similar procedure, 
the expanded system may further be extended in the next higher level of the hierarchy. 

It may be noted that the checks introduced at different levels of hierarchy may be of 
different types having different values of error detectability. In order to simplify the 
development of our theory, we assume that all the checks in the system are similar, and 
have the same error detectability. However, the value of g may be different. An impor- 
tant restriction we impose while expanding the system is that there is no data migration 
allowed between processors in different copies of the basic unit. In other words, faults in 
one copy of the basic unit will not affect other copies. 
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Now we derive the relationships between the fault detectability (locatability) of a 
basic unit and a system obtained by hierarchically expanding the basic unit. 

THEOREM 6.2. If t and / are the fault detectability and locatability of a bounded 
basic system B, and S is a k-fold ( k <.g) d-level hierarchical expansion of B, then the fault 
detectability (T d ) and locatability (L^) of S are 


T d 


I B I . k 


d - 1 


(r+l)(A+l) d-1 - 1 


for \<k<ih 
for k > h 


I* = 2 d ~\l + l)-\ for* > 1 

where IB I represents the number of processors in the basic system. 

PROOF: We prove the theorem by induction on the number of levels, d. 

Proof for the detectability part: 

Case 1. 1 <* £ h 

Basis: d = 2 

Consider the lowest level and the next higher level of hierarchy (i.e., level 2). When 
k£h, none of the checks in level 2 will become invalid for any combination of faults in B 
and the copies of B, since none of these checks are evaluating more than h data elements. 
On the other hand, at least one of these checks will fail for any combination of faults in 
the system. Therefore, the fault detectability is equal to the number of processors in the 
system which is equal to IS I .k. 

Inductive Step: 

Let the hypothesis be true for the number of levels up to d - 1. Then 
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T d _ x = IB I k d ~ 2 . 

Now considering S d . \ as the basic unit and applying the basis case, we arrive at 

T d = I5 d _i \.k-{\B I Jt d ~ 2 ).k = IB I .k d ~ l . 

Note that in this case the resulting system is not bounded. 

Case 2. k> h. 


Basis: d=2 

Here, also, we first consider the basic unit B and its next level of hierarchy. We now 
prove that there exists at least one fault pattern of cardinality (r+l)(A+l) which will not be 
detected in the 2 -level system. Let there be identical fault patterns of cardinality (r+1) 
occurring in (A+l) copies of B. These faults will not be detected by checks inside the 
copies of B, since the fault detectability of B is equal to t. Every check at the second level 
is either checking processors which are not faulty or (A+l) processors which are faulty. 
In either case, the checks may produce a "pass" output (since the error detectability of the 
checks is eq ual to A, faults of cardinality (h+1) may invalidate the checks). This means 
that the faults will not be detected in this level, also. 

Now we prove that every fault pattern of cardinality less than or equal to 
(r+l)(A+l) — 1 will be detected. In order for such a fault pattern not to be detectable in the 
lower level, it is necessary that the copies of B having faulty processors should have more 
than t faults present in them. The fault is not detectable in the higher level only if the 
checks evaluating the faulty processors check more than h errors. If we distribute 
(r+l)(A+l)- 1 faults into (A+l) copies such that every basic unit has at least (r+1) faults, 
by the pigeon hole principle [55], at least one of the subunits will have < t faults, and 
hence the fault is detectable in that copy. Conversely, if every subunit has > (r+1) faults. 
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at least one check in the next level will be checking £h faults, and will detect those 
faults. 

Therefore, the fault detectability of a 2-level expansion of B is (r+l)(A+l) - 1. 
Inductive Step: 

Let the hypothesis be true for the number of levels up to d- 1. Then 

T d _, = (r+lXA+l)*- 2 - 1. 

Now applying the basis case, we have 

T d = (T d - X + 1)(A+1) - 1 = (r+lXA+l)*- 1 - 1. 

Proof for the locatability part: 

Basis: d=2 

Here, also, we first consider the lowest level of hierarchy and the next level. First, we 
prove that there exists a fault pattern of cardinality (21 +2) which will not be correctly 
located in a 2-level system. Consider two identical fault patterns of size (/ + 1) occurring 
on two copies of B. Since the locatability of B is /, these fault patterns will not be located 
correctly within the copies of B (that is, the internal checks cannot locate the faults 
correctly). Even if all the checks in the next level detect faults, it may not be possible to 
locate the faults among the copies of B. 

Now we prove that any fault pattern of cardinality £ 21 + 1 will always be correctly 
located in a 2-level system. Consider a fault pattern of size (21 + 1). If we distribute the 
individual faults in this pattern among various copies of B, by the pigeon hole principle, 
at most one copy of B will have > (/ + 1) faults. Let us denote such a copy as B t . Now the 
total number of individual faults distributed among all the other copies will be < /. 


107 


Therefore, any of those copies will have a fault pattern of cardinality < l and will be 
correcdy located by checks within the copy. Now let us consider locating the fault 
within Bi. The maximum size of a fault which may occur in B, is 21 + 1 in which case 
none of the other copies will have any fault in them. Since t>(2l + 1) [18], the fault in B, 
wiU be detected by checks within fl„ The checks in the next level are checking only one 
fault each, and therefore, they can locate the faults within B,. Hence the fault is locat- 

able. 

Now we consider a more general case in which the number of faults in B, is (/ + r) 
where r > 1. The rest of the copies of B will have a total of (/ - r + 1) faults. Regardless of 
the way these faults are distributed among these copies, there will be at least (2 r - 1) 
number of faults in B, such that there are no faults in any other copy of B in the 
corresponding positions (we refer to these faults as unobscured faults). These (2r-l) 
faults in B, can be located by checks in the higher level. The remaining (Z - r + 1) faults 
can be uniquely located with the help of the syndrome generated by the internal checks 
of Bi since (/ - r + 1) £ /. Therefore, any fault pattern of cardinality <;(2/ + l) can be 
uniquely located in a 2— level system. 

Inductive Step: 

Let the hypothesis be true for the number of levels up to d-l. Then 

L d . x = 2‘*- 2 (/+l)-l. 

Now applying the basis case, we arrive at 

L d = 2L d -i + 1 =2 d ~' (/+!)- 1. 
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It may be noted that the fault detectability of a system increases as the number of 
copies (k) in the same level increases, until k reaches a value equal to h. For k > h, the 
system is bounded in that level and the fault detectability attains a constant value as 
described in the beginning of this section. Locatability, however, is independent of the 
value of k. This is because of the generality of the definition of checks where we assume 
that an individual check can locate no faults, even though it can detect multiple faults. 

In the following, we present two examples to illustrate the hierarchical construction 
of fault- tolerant systems. 

EXAMPLE 6.3. As a first example, we consider a linear processor array as shown in 
Figure 6.4 (a). In the array, all the processors are evaluated by a check with error detec- 
tability, h = 1. Fault detectability of such a linear array is equal to 1 (i.e., t=l) and fault 
locatability is equal to 0 0=0). Now we expand the system hierarchically to form a two 
dimensional mesh connected processor array as shown in Figure 6.4 (b). The newly 
added checks in the new level are shown by dotted lines. 

By the previous theorem, the fault detectability of the mesh connected processor 
array is 

7*2 = (r-HlXA-j-1)^- 1 - 1 =2x2-1 =3. 


The fault locatability L is 


L 2 -2 d ~ x (l + 1) - 1 = 2 x (0 + 1) — 1 ml. 

These values conform to the values obtained from the analysis using the nonhierarchical 
algorithms presented in Section 6.2. 
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Figure 6.4. Hierarchical expansion of a linear array. 


EXAMPLE 6.4. As a second example, we consider the hierarchical expansion of the 
Advanced Onboard Signal Processor (AOSP) architecture [48]. From the analysis of this 
system we find that the system is 3 -fault detectable and single fault locatable. 

Now let us consider a 4 —fold 2— level expansion of AOSP (i.e., using AOSP as the 
basic system). The expansion scheme is illustrated in Figure 6.5. Every copy of AOSP 
is associated with six internal checks. Nine additional checks are added in the next level 
as shown in the figure by dotted lines. By Theorem 6.2, the fault detectability and locata- 
bility of this system are 7 and 3, respectively. ^ 


6.3 2 . The number of checks in the hierarchical system 

In this section we compute the number of checks required in the hierarchical con- 
struction of a system in terms of the number of checks in the basic unit and the number of 
levels in the system. Here we assume that all the checks in the system (including the 
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Figure 6.5. Hierarchical expansion of AOSP architecture, 
checks in Cl) ate bounded. 

THEOREM 6.3. The number of checks used in a hierarchical system built by a 
k-fold, d -level expansion of a basic unit is 

H d = r k d ~ l +n(d-l)k d ~ 2 . 

where r is the number of checks in the basic unit and n is the number of processors in the 
basic unit 

PROOF: By construction, the number of checks in a hierarchical system satisfies 
the recursive equation 

H d = k H d . i +N d .i 

where N d . x is the number of checks in the (d-i)-level system. Since N d = n k d ~ l , 


H d = k H d . { + n k d ~ 2 . 
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Solving the recursive equation with boundary conditions, H\-r and H 2 -rk + n yields 
that 

H d = r k d ~ x +n (d - 1) k d ~ 2 . □ 

However, we observe that it is not necessary to have all the H d checks. The obser- 
vation is elaborated in the following lemmas. 

LEMMA 6.6. There exists a 2— level system with detectability T 2 which requires only 
H 2 -T\ checks. 

Proof: We shall prove that even if we remove any T x (note that T x - 1) checks 
from the set of checks in the second level, the detectability of a 2— level system remains 
the same as T 2 . Any detectable fault in the system should be detected either at the lower 
level (by C/s) or in the higher level. If the fault is detected in the lower level, removal of 
checks from the higher level is not going to affect the detectability of the fault. There- 
fore, we need to consider faults which are detectable only in the higher level. If siich a 
fault occurs, some copies of the basic unit will have £ (f + 1) number of faults whereas the 
rest of the copies will not have any faults at all. However, from Theorem 6.2 we know 
that there are at most h copies of the basic unit having ^ (r + 1) number of faults. In Fig- 
ure 6.6, the large ellipses represent copies of the basic unit which have > (t + 1) faults. 
The shaded portions represent T x processors in every basic unit which are not checked in 
the next higher level. The corresponding checks which are removed from the system are 
denoted as set U. Since the size of the fault patterns present in the copies is > (t + 1), at 
least one faulty processor in that copy will be checked by a check C h in the set ( CH - U). 
Since the number of such faults checked by every check in ( CH — U ) is at most h , the 
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fault will be detected. Thus, the detectability is unaffected despite the removal of T x 
checks from the original hierarchical system. □ 

A similar result exists related to the locatability of a system. However, a distinction 
has to be made between the problems of detectability and locatability here: in the case of 
detectability, we need help from the higher-level checks only when the fault is undetect- 
able in all the copies of the basic unit whereas, in the case of locatability we need to use 
the higher-level checks whenever at least one copy of the basic unit has a fault pattern 
that is unlocatable with the help of the internal checks. Intuitively, we cannot remove as 
many checks in the case of locatability as in the case of detectability and still preserve 
the overall locatability of the system. 

LEMMA 6.7. There exists a 2 -level system with locatability Li which requires only 
Hi - 1 checks. 

PROOF: From Theorem 6.2, we know that there is at most one copy of the basic 
unit which has t (/ + 1) number of faults, and we know that the purpose of the higher 
level checks is to locate correctly the unobscured faults. Therefore, we must ensure that 
the unobscured faults should not lie entirely inside the shaded region (that is, the set of 




Figure 6.6. Unnecessary checks in the second level of hierarchy. 
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processors which had been checked by the checks in U). Since the minimum size of the 
set of unobscured faults is one , the maximum number of checks we can remove from CH, 
without altering the locatability of the system, is one. d 

Now we generalize these results for a d— level system. Even though the saving in 
terms of the number of checks is small for a 2 -level system, it will be shown that the 
overall saving may be significant for larger values of d. 

THEOREM 6.4. There exists a k—fold, d— level system with detectability T d using H d - 
Aet U d number of checks, where 


det 


Ud = 


(*-l) 


i~d—\ 

[((r+l) E <* 

i-1 


d-i 


- 1) (A + 1) 


i-t 


i*d— 1 

)- Z (* 

i- 1 


d - 1 


- 1 ) 1 . 


PROOF: From I ctnma 6.6, we know that in a 2— level system, T i number of checks 
are unnecessary. In the hierarchical expansion, the second level systems will be con- 
sidered as the new basic units and are replicated in the third level. Here, the overall sav- 
ing will be T\ k + Tz. If we recursively calculate the number of checks saved, we arrive 


at 

J t W-l jmd-i- 1 

dcx u d = E 7, ( E k J ). 

1*1 /"0 

Now substituting the value for 7, as (f+l)(A+l)‘ _l , we have 

**U d = t r^-rr [((f + 1) zV"* - D (A + l) i_1 ) zV' 1 - D 1- 

(* ~ 1) i-l 1*1 


□ 


THEOREM 6.5. There exists a k-fold, d-level system with detectability T d and loca- 
tability L d using H d - l0C U d number of checks, where 

. i=d - 1 j*i- 2 

lOC U d m E ( E k J ). 

1-1 jm 0 
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PROOF: The proof is very similar to the previous theorem except that in every 
level we save only one check during expansion. In the second level we save one check, 
in the third level (k + 1), and so on. In general the number of checks saved in level i is 

y*i-2 

equal to £ k J . Summing all those values up to level d, 

;• o 

loc ‘*4-1 j~>- 2 ■ 

U d * 2 ( 5 : *'). 

i-1 ;-0 L-1 

6.3J. Hierarchical analysis of systems 

The hierarchical principles derived in the preceding section can be translated into 
the domain of the fundamental matrices which constitute the matrix model. Replication 
of the basic unit is equivalent to a repetition of the PC matrix of the basic unit along the 
diagonal. The addition of checks in the new dimension is tantamount to adding identity 
matrices (one identity matrix per diagonal submatrix) to the expanded PC matrix. The 
matrix equivalent of the hierarchical expansion of a system is shown in Figure 6.7. Here 
PC represents the PC matrix of the basic unit and / is an identity matrix. 

In order to analyze a given system hierarchically, we first arrange the rows and 
columns of the PC matrix in such a way that the final matrix is in the form shown in Fig- 
ure 6.7. Now, the detectability and locatability of the basic PC matrix can be computed, 
from which the detectability and locatability of the entire system can be derived using the 
results in Theorem 6.2. Note that, typically, the size of the basic unit is considerably 
smaller than the size of the entire system. 

However, in certain designs, it may be the case that the diagonal PC matrices will 
have the same number of rows, but a different number of columns, that is, during 
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Figure 6.7. The PC matrix of a hierarchical system. 

replication of the basic unit, all the internal checks ( Cl) were not replicated. In this case 
the fault detectability and the locatability of the images of the basic unit may be different. 
In such a case we cannot compute the actual values of fault detectability and locatability 
of the hierarchical system. However, we can calculate a lower bound on these figures. 

COROLLARY 6.1. (Of Theorem 6.2.) If the diagonal PC matrices have different 
detectabilities and locatabilities, then the detectability and locatability of the hierarchical 
system are bounded by 

T d > (fmin + l)(A+iy'"' - 1; 

La ?> 2 d-1 (/min + l)-l; 

where is the minimum value of detectability and is the minimum value of locata- 
bility among the copies of basic units. 
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6.4. Conclusions 

We developed the concept of independent and orthogonal checks depending upon 
the set of processors checked by the given sets of checks. Using orthogonal checks , 
hierarchical techniques for the design and analysis of large fault-tolerant multiprocessor 
systems were developed. We introduced the method to model different levels of checks, 
which greatly simplifies the analysis and design of systems. The relationships between 
the fault-diagnosing capabilities of basic systems and their hierarchical expansions were 
derived. 


117 


CHAPTER 7. 
CONCLUSIONS 


7.1. Summary of Results 

In many critical applications of VLSI-based computer systems, it is important to 
have high performance as well as high reliability. High reliability has been achieved by 
the application of fault tolerance techniques. Since the fault tolerance techniques are 
dependent on the redundancy involved in the computations, such systems are either 
costly due to the hardware redundancy involved or they are unable to reach high perfor- 
mance levels due to the time redundancy. Therefore, the problem in hand is to investi- 
gate techniques by which a high degree of fault tolerance can be achieved without 
sacrificing too much performance. Algorithm-based fault tolerance (ABFT) has been 
proposed as a cost effective scheme to achieve fault tolerance in multiprocessor systems. 
These schemes use functional as well as system-level concurrent error detection for the 
fault diagnosis in a system. 

This thesis has addressed the problem of modeling fault-tolerant systems using con- 
current error detection schemes in general and those using ABFT schemes in particular. 
The major results in the thesis are recapitulated in the following. 

In Chapter 2, we have given a general description of multiprocessor systems which 
have been selected for the application of ABFT systems. In order to exemplify the 
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technique, we illustrated how fault-tolerant matrix multiplication can be performed on a 
mesh-connected processor array using checksum encoding techniques. Since most of the 
signal processing computations can be represented as matrix operations, it is desirable to 
have generalized encoding schemes for fault-tolerant matrix operations. In this chapter, 
we developed a general set of real-number codes for these computations. We proved that 
for every linear finite-field code, there exists a real-number code having similar error 
diagnosing capabilities as the finite-field code. Since most of the codes known until now 
fall in the set of finite-field codes, our new result has a far-reaching effect in the area of 
coding theory as it forms a bridge between finite-field codes and real-number codes. 

A matrix-based model for ABFT systems is presented in Chapter 3. The model 
consists of three matrices: the PD (processor-data), the DC (data-check), and the PC 
(processor-check) matrix. The model used a broad interpretation of faults, errors, and 
checks. The problem of invalidation of a check, performed by a faulty processor, is 
efficiently handled by translating it into the problem of error detection at the output of the 
faulty processor. Based on the model, various necessary and sufficient conditions for the 
fault detectability and locatability of systems are derived. Using these constraints and 
sufficient conditions, algorithms were developed for the analysis of ABFT systems. 
These algorithms are much less complex than the previously available algorithms. A 
detailed discussion of the algorithms is given in Chapter 4. 

Chapter 5 dealt with design of ABFT systems. We developed a systematic and 
straightforward methodology for the design of ABFT systems. The design requires a 
smaller number of checks when compared to the previous bounds, especially when the 
individual processors in the system are computing large volumes of data. Other 



119 


advantages include the flexibility of the algorithm to accommodate varying amounts of 
computation performed by the computing nodes and the ability to handle detectability 
and locatability of the system simultaneously. The application of the matrix model 
helped in identifying the redundant checks using simple matrix operations. 

In Chapter 6, we introduced a hierarchical approach for the analysis of fault- tolerant 
multiprocessor systems. Even though inclusion of checks at different levels of hierarchy 
has been practised in the past, the analysis of such systems was earned out on the basis of 
a nonhierarchical ("flat") description of the system. We proposed a hierarchical approach 
for the analysis of these systems. We treat the checks at different levels of hierarchy. 
The fault tolerance of the system at different levels is estimated separately and the 
overall fault tolerance is derived from those values. In order to illustrate the concept, we 
introduced a special type of hierarchy for the design of multiprocessor systems. This par- 
ticular type was chosen since it is easily applicable to most of the commercially available 
multiprocessors. In addition, we observed that this particular type of hierarchy maxim- 
izes the fault detectability and locatability of the overall system for a given error- 
detecting capability of the individual checks. 

7.2. Suggestions for Future Research 

Even though ABFT techniques have been applied to most of the signal processing 
computations, the applicability of the technique in other computations and data manipu- 
lations has to be further investigated. As mentioned in Chapter 2, the ABFT techniques 
are application specific. However, it may be possible to identify groups of computations 
which can use similar encoding schemes to make the computation fault-tolerant. For 
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instance, we have shown that there exists a general set of real-number codes applicable to 
various matrix operations such as multiplication, addition, transposition, and LU- 
decomposition. Similar generalization of codes for various other computations is desir- 
able. 

In the graph model as well as in the matrix model, it is assumed that all the data 
values checked by a check processor are available simultaneously at the input of the 
check processor. In fact, this is a general assumption made by researchers in coding 
theory. However, in system level diagnosis the availability of a particular data element 
at the input of a checking processor is dependent on: (1) the computing speed of the par- 
ticular node which computes that data element; (2) the speed and band width of the com- 
munication channel between the computing node and the evaluation node; (3) the data 
traffic in the system. Therefore, it is desirable to include some timing features into the 
check evaluation process. In [56] the researchers use Petri Nets to study the timing 
behavior of fault-tolerant systems. The limitation of this approach was that even for sys- 
tems having a small number of processors, it takes a large amount of time to verify the 
fault tolerance capabilities of the system. It will be interesting to study the possible 
extension of the matrix-based model to include time-dependent checks. We believe that 
with such a formulation, a faster evaluation of fault-tolerant systems will be possible. 

Another suggestion is to extend the field of application of the matrix-based model. 
The advantage of the proposed model is that it is independent of the particular computa- 
tional algorithm associated with the ABFT system. In order to model a system we need 
to know only the relationship between the various entities in the system. It is not difficult 
to model a fault-tolerant software system using this model. The difference between 
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hardware and software modeling is that instead of each processor module in the hardware 
case, we will have a program module in the software system. In fact, the use of the 
model in the analysis and design of fault-tolerant software systems will be even more 
effective since the interaction between various nodes in the system is not limited by the 
physical interconnection between them. 

The hierarchical approach developed in Chapter 6 deals with only one type of 
hierarchy. Even though this covers many of the commercially available fault-tolerant 
array processors, a generalization of the concept of hierarchy is desirable. In the pro- 
posed hierarchy, the checks at different levels are assumed to be orthogonal to each other. 
A general case may be derived by assuming less stringent relationships between the 

checks. 

As mentioned in Chapter 5, there have been two approaches followed by ABFT sys- 
tem designers: (1) given a non-fault-tolerant system, determine an efficient distribution 
of checks among the output data elements so that the system has the desired amount of 
fault tolerance; (2) given a fault-tolerant algorithm, synthesize an architecture so as to 
maximize quantities such as the fault detectability and locatability of the system. In the 
first approach, the fault-tolerant design is constrained by the fixed, non-fault- tolerant 
architecture. Often, this may result in an inefficient design (as far as fault tolerance is 
concerned); however, it preserves the high performance of the original architecture. In 
the second approach, performance may be sacrificed in the process of achieving high 
fault tolerance. Therefore, a more efficient approach would be to synthesize fault- 
tolerant architectures directly from the original algorithms so that the architecture is 
optimal with respect to performance, diagnosability, and reconfigurability. 
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